This Page Is Inserted by IFW Operations 
and is not a part of the Official Record 

BEST AVAILABLE IMAGES 



Defective images within this document are accurate representations of 
the original documents submitted by the applicant. 

Defects in the images may include (but are not limited to): 



BLACK BORDERS 

TEXT CUT OFF AT TOP, BOTTOM OR SIDES 
FADED TEXT 
ILLEGIBLE TEXT 
SKEWED/SLANTED IMAGES 
COLORED PHOTOS 

BLACK OR VERY BLACK AND WHITE DARK PHOTOS 
GRAY SCALE DOCUMENTS 



IMAGES ARE BEST AVAILABLE COPY. 



As rescanning documents will not correct images, 
please do not report the images to the 
Image Problem Mailbox. 



1* 



PTO/SB/17 (01-03) 
Approved for use through 10/31/2002. OMB 0651-0032 
\. . , „ . * , U.S. Patent and Trademark office: U.S. DEPARTMENT OF COMMERCE 

Under the Paperwork Reduction Act of 1995, no persons are required to respond to a collection of information unless it displays a valid OMB control 

number 



EE TRANSMITTAL 
for FY 2003 

Patent fees are subject to annual revision. 



| X | Applicant claims small entity status. See 37 CFR 1 .27 

T ($) 165.00 



TOTAL AMOUNT OF PAYMENT 



Application Number 



Filing Date 



First Named Inventor 



Examiner Name 



Group Art Unit 



Attorney Docket No. 



Complete if Known 



09/916,122 



07/26/01 



Friddle 



J. Ulm 



1646 



LEX-0206-USA 



METHOD OF PAYMENT (check all that apply) 



| | Check QcreditCard [] Other Q None 

| X | Deposit Account 



50-0892 



Lexicon Genetics Incorporated 



Deposit 
Account 
Number 
Deposit 
Account 
Name 

The Commissioner is authorized to: (check all that apply) 
| X | Charge fee(s) indicated below ^ Credit any overpayments 
| X | Charge any additional fee(s) during the pendency of this application 

I | Charge fee(s) indicated below, except for the filing fee 

to the above-identified deposit account. 



FEE CALCULATION 



1. BASIC FILING FEE 



Large Entity 



Fee Fee 

Code ($) 

1001 770 

1002 340 

1003 530 

1004 770 

1005 160 



Small Entity 



FEE CALCULATION (continued) 



3. ADDITIONAL FEES 



Fee 
Code 

2001 
2002 
2003 
2004 
2005 



Fee Fee Descript ion 

($) K 

385 Utility filing fee 

170 Design filing fee 

265 Plant filing fee 

385 Reissue filing fee 

80 Provisional filing fee 



Fee Paid 



SUBTOTAL (1) ($) 



2. EXTRA CLAIM FEES FOR UTILITY AND REISSUE 

Fee from 

Extra Claims below Fee Paid 



Total Claims 




-20"= 




Independent 
Claims 




-3"= 





Multiple Dependent 



Large Entity 



Fee 
Code 

1202 

1201 

1203 

1204 

1205 



Fee 
($) 
18 
86 
290 



18 




Small Entity 



Fee 
Code 

2202 

2201 

2203 

2204 

2205 



Fee 
($) 



Fee Description 

9 Claims in excess of 20 
43 Independent claims in excess of 3 
1 45 Multiple dependent claim, if not paid 
43 "Reissue independent claims 
over original patent 

" Reissue claims in excess of 20 
and over ori ginal patent 

SUBTOTAL (2) |( $ > | 



9 



"or number previously paid, if greater; For Reissues, see above 



Fee 


Fee 




Coo 

pee 




Code 


(S) 


Code 


($) 


Fee Description 


1051 


130 


2051 


65 


Surcharge - late fling fee or oath 


1052 


50 


2052 


25 


Surcharge - late provisional filing fee or cover 
sheet 


1053 


130 


1053 


130 


Non-English specification 


1812 


2,520 


1812 


2,520 


For filing a request for ex parte reexamination 


1804 


920* 


1804 


920* 


Requesting publication of SIR prior to Examiner 
action 


1805 


1,840* 


1805 


1,840" 


Requesting publication of SIR after 
Examiner action 


1251 


110 


2251 


55 


Extension for reply within first month 


1252 


420 


2252 


210 


Extension for reply within second month 


1253 


950 


2253 


475 


Extension for reply within third month 


1254 


1,480 


2254 


740 


Extension for reply within fourth month 


1255 


2,010 


2255 


1,005 


Extension for reply within fifth month 


1401 


330 


2401 


165 


Notice of Appeal 


1402 


330 


2402 


165 


Filing a brief in support of an appeal 


1403 


290 


2403 


145 


Request for oral hearing 


1451 


1,510 


1451 


1,510 


Petition to institute a public use proceeding 


1452 


110 


2452 


55 


Petition to revive - unavoidable 


1453 


1,330 


2453 


665 


Petition to revive - unintentional 


1501 


1,330 


2501 


665 


Utility issue fee (or reissue) 


1502 


480 


2502 


240 


Design issue fee 


1503 


640 


2503 


320 


Plant issue fee 


1460 


130 


1460 


130 


Petitions to the Commissioner 


1807 


50 


1807 


50 


Processing fee under 37 CFR 1.1 7(q) 


1806 


180 


1806 


180 


Submission of Information Disclosure Stmt 


8021 


40 


8021 


40 


Recording each patent assignment per property 
(times number of properties) 


1809 


770 


2809 


385 


Filing a submission after final rejection 
(37 CFR § 1.129(a)) 


1810 


770 


2810 


385 


For each additional invention to be 
examined (37 CFR § 1.129(b)) 


1801 


770 


2801 


385 


Request for Continued Examination (RCE) 


1802 


900 


1802 


900 


Request for expedited examination 
of a design application 



Fee Paid 



Other fee (specify) 



165.00 



•Reduced by Basic Filing Fee Paid 



SUBTOTAL (3) | ($) 165 00 



SUBMITTED BY 


Complete (if applicable) 


Name (Print/Type) 


Lance K. Ishimoto 


1 Registration No. 1 ... QRft 
| (Attorney/Agent) | 41 ' Hbe 


Telephone 


(281)863-3333 


Signature c 




fh<&c*&t fre**,*** m,e-7{ 


Date 


December 29, 
2003 



Customer # 24231 

WARNING: Information on this form may become public. Credit card information should not be 
included on this form. Provide credit card information and authorization of PTO-2038. 

Burden Hour Statement: This form is estimated to take 0.2 hours to complete. Time will vary depending upon the needs of the individual case Anv comments 
on the amount of time you are required to complete this form should be sent to the Chief Information Officer, U.S. Patent and Trademark Office Waqhinaton 
DC 20231. DO NOT SEND FEES OR COMPLETED FORMS TO THIS ADDRESS. SEND TO: Assistant Commissioner fol - P^^W^^K 



IN THE UNITED STATES PATENT AND TRADEMARK OFFICE 

Application of: Friddle et al. 

Serial No.: 09/916,122 Group Art Unit: 1646 

Filed: 07/26/2001 Examiner: J. Ulm 

For: Novel Human 7TM Protein and Polynucleotides Attorney Docket No.: LEX-0206-USA 
Encoding the Same 



APPEAL BRIEF 



Mail Stop Appeal Brief - Patents 

Commissioner for Patents 
P.O. Box 1450 
Alexandria, VA 22313-1450 



TABLE OF CONTENTS 

I. REAL PARTY IN INTEREST 1 

H. RELATED APPEALS AND INTERFERENCES 1 

m. STATUS OF THE CLAIMS 2 

IV. STATUS OF THE AMENDMENTS 2-3 

V. SUMMARY OF THE INVENTION 3 

VI. ISSUES ON APPEAL 3 

VH. GROUPING OF THE CLAIMS 3 

VIE. ARGUMENT 4_16 

A. Do Claims 1-5 Lack a Patentable Utility? : 4-16 

B. Are Claims 1-5 Unusable Due to a Lack of Patentable Utility? 16 

IX. APPENDIX 17 

X. CONCLUSION 18 



-ii- 



TABLE OF AUTHORITIES 
CASES 

Amgen, Inc. v. Chugai Pharmaceutical Co., Ltd., 927 F.2d 1200, 18 USPQ2d 1016 (Fed. Cir. 



1991) 10 

Brooktree Corp. v. Advanced Micro Devices, Inc., 977 F.2d 1555, 1571, 24 USPQ2d 1401 (Fed. 
Cir. 1992) 14 

Carl Zeiss Stiftung v. Renishaw PLC, 20 USPQ2d 1101 (Fed. Cir. 1991) (citing Envirotech Corp. 
v. Al George, Inc., 221 USPQ 473, 480 (Fed. Cir. 1984)) 6, 7, 14 

Cross v. Iizuka, 753 F.2d 1040, 224 USPQ 739 (Fed. Cir. 1985) 14 

Diamond vs. Chakrabarty, AA1 U.S. 303, 206 USPQ 193 (U.S., 1980) 15 

Hoffman v. Klaus, 9 USPQ2d 1657 (Bd. Pat. App. & Inter. 1988) 11 

In re Angstadt and Griffin, 537 F.2d 498, 190 USPQ 214 (CCPA 1976) 9, 10 

In re Brana, 51 F.3d 1560, 34 USPQ2d 1436 (Fed. Cir. 1995) 8, 9, 16 

In re Fouche, 439 F.2d 1237, 1243, 169 USPQ 429, 434 (CCPA 1971) 16 

In re Gottlieb, 328 F.2d 1016, 140 USPQ 665 (CCPA 1964) 11 

In re Jolles, 628 F.2d 1322, 1326 n.ll, 206 USPQ 885, 889 n.ll (CCPA 1980) 16 



In re hanger, 503 F.2d 1380, 183 USPQ 288 (CCPA 1974) 10 

In re Malachowski, 530 F.2d 1402, 189 USPQ 432 (CCPA 1976) 11 

In re Marzocchi, 439 F.2d 220, 224, 169 USPQ 367, 370 (CCPA, 1971) 10 

In re Wands, 858 F.2d 731, 8 USPQ 2d 1400 (Fed. Cir. 1988) 5, 10 

Juicy Whip Inc. v. Orange Bang Inc., 185 F.3d 1364, 51 USPQ2d 1700 (Fed. Cir. 1999) (citing 

Brenner v. Manson, 383 U.S. 519, 534 (1966)) 14 

Raytheon Co. v. Roper Corp., 724 F.2d 951, 220 USPQ 592 (Fed. Cir. 1983) 11 

State Street Bank & Trust Co. v. Signature Financial Group Inc., 149 F.3d 1368, 47 USPQ2d 

1596, 1600 (Fed. Cir. 1998) 14, 15 

-iv- 



STATUTES 



35 U.S.C. § 101 
35 U.S.C. § 102 
35 U.S.C. § 112 



2-4, 6-11, 13-16 

2 

..2,3,9,15, 16 



-v- 



DEC 2 9 2003 » 

%> J/ APPEAL BRIEF 

Sir: 

Appellants hereby submit an original and two copies of this Appeal Brief to the Board of Patent 
Appeals and Interferences ("the Board") in response to the Final Office Action mailed on May 22, 2003. 
The Notice of Appeal was timely submitted on August 22, 2003, and was received in the Patent and 
Trademark Office ("the Office") on August 28, 2003. This Appeal Brief is timely submitted in light of the 
concurrently filed Petition for an Extension of Time of two months to and including December 28, 2003, 
which falls on a Sunday and is therefore extended until Monday, December 29, 2003 under 
37 C.F.R. § 1.7, and authorization to deduct the fee as required under 37 C.F.R. § 1.17(a)(2) from 
Appellants' Representatives' deposit account. The Commissioner is also authorized to charge the fee for 
filing this Appeal Brief ($165.00), as required under 37 C.F.R. § 1.17(c), to Lexicon Genetics 
Incorporated Deposit Account No. 50-0892. 

Appellants believe no fees in addition to the fee for filing the Appeal Brief and the fee for the 
extension of time are due in connection with this Appeal Brief. However, should any additional fees under 
37C.F.R. §§ 1.16to 1.21 be required for any reason related to this communication, the Commissioner 
is authorized to charge any underpayment or credit any overpayment to Lexicon Genetics Incorporated 
Deposit Account No. 50-0892. 

I. REAL PARTY IN INTEREST 

The real party in interest is the Assignee, Lexicon Genetics Incorporated, 8800 Technology Forest 
Place, The Woodlands, Texas, 77381. 

II. RELATED APPEALS AND INTERFERENCES 

Appellants know of no related appeals or interferences that will directly affect or be directly 
affected by or have a bearing on the Board's decision in the pending appeal. 



III. STATUS OF THE CLAIMS 

The present application was filed on July 26, 2001, claiming the benefit of U.S. Provisional 
Application Number 60/22 1 ,0 1 2, which was filed on July 27, 2000, and included original claims 1 and 2. 
A First Official Action on the merits ("the First Action") was issued on October 1 , 2002, in which claims 1 
and 2 were rejected under 35 U.S. C. § 101 as allegedly lacking a patentable utility, claims 1 and2 were 
rejected under 35 U.S.C. § 1 12, first paragraph as allegedly unusable by the skilled artisan due to the 
alleged lack of patentable utility, claim 1 was rejected under 35 U.S.C. § 112, second paragraph, as 
allegedly indefinite, and claim 1 was rejected under 35 U.S.C. § 102(a) as allegedly anticipated by 
Bellenson et al (W O 01121 158; "Bellenson"). In a response to the First Official Action submitted to the 
Office on March 3, 2003 ("Response to the First Action"), Appellants amended claims 1 and 2, added 
new claims 3-5, and addressed the various rejections of claims 1 and 2. 

A Second and Final Official Action ("the Final Action") was issued on May 22, 2003 , indicating 
that the rejections of claim 1 under 35 U.S.C. § 112, second paragraph, as allegedly indefinite, and claim 1 
under 35 U.S .C. § 102(a) as allegedly anticipated by Bellenson, had been overcome by the amendments 
and remarks submitted in the Response to the First Action, but maintaining the rejections of claims 1 and 2 
(and newly added claims 3-5) under 35 U.S.C. § 101 as allegedly lacking a patentable utility, and under 
35 U.S.C. § 1 12, first paragraph as allegedly unusable by the skilled artisan due to the alleged lack of 
patentable utility. In a response to the Final Action submitted to the Office on August 22, 2003 ("Response 
to the Final Action"), Appellants again addressed the rejections of claims 1-5. 

An Advisory Action ("the Advisory Action") was mailed on October 10, 2003, maintaining the 
rejections of claims 1-5 under 35 U.S.C. § 101 as allegedly lacking a patentable utility, and under 
35 U.S.C. § 1 12, first paragraph as allegedly unusable by the skilled artisan due to the alleged lack of 
patentable utility. Therefore, claims 1-5 are the subject of this appeal. A copy of the appealed claims are 
included below in the Appendix (Section IX). 

IV. STATUS OF THE AMENDMENTS 

As no amendments subsequent to the Final Action have been filed, Appellants believe that no 



outstanding amendments exist. 



V. SUMMARY OF THE INVENTION 

The present invention relates to Appellants' discovery and identification of novel human 
polynucleotide sequences that encode a novel G protein-coupled receptor that spans the cellular membrane 
and is involved in signal transduction after ligand binding, and that has structural motifs found in the seven 
transmembrane domain (7TM) receptor family (specification at page 2, lines 9-13, and at page 4, 
lines 20-23). 

The presently claimed polynucleotide sequences were compiled from cDNA clones from human 
adipose and testis cDNA libraries (specification at page 7, lines 13- 14). Two coding single nucleotide 
polymorphisms were identified in the claimed sequence - specifically, a T/G polymorphism at position 233 
of SEQ ID NO: 1 , which can lead to a valine or glycine residue at amino acid position 78 of SEQ ID NO:2, 
and a C/T polymorphism at position 3 16 of SEQ ID NO: 1 , which can lead to an arginine or cysteine 
residue at amino acid position 106 of SEQ ID NO:2 (specification at page 7, lines 21-30). 

The specification details a number of uses for the presently claimed polynucleotide sequences, 
including in diagnostic assays such as forensic analysis (see, for example, the specification at page 14, 
lines 5-8), in assessing gene expression patterns, particularly using a high throughput "chip" format (see, 
for example, the specification at page 9, lines 15-17), and in mapping a unique gene to a particular 
chromosome (see, for example, the specification at page 3, lines 36-37). 

VI. ISSUES ON APPEAL 

1. Do claims 1-5 lack a patentable utility? 

2. Are claims 1-5 unusable by a skilled artisan due to a lack of patentable utility? 

VII. GROUPING OF THE CLAIMS 

Forthe purposes of the outstanding rejections under 35 U.S. C. § 101 and35U.S.C. § 112, first 
paragraph, associated with the utility rejection, the claims will stand or fall together. 



VIII. ARGUMENT 

A. Do Claims 1-5 Lack a Patentable Utility? 

The Final Action first rejects claims 1-5 under 35 U.S.C. § 101, as allegedly lacking a patentable 
utility due to not being supported by either a specific and substantial or a well-established utility. 

Appellants pointed out both in the Response to the First Action and the Response to the Final 
Action that the present nucleic acid sequences have utility in diagnostic assays, such as forensic analysis, 
as described in the specification as originally filed (see, for example, page 14, lines 5-8). As described in 
the specification on page 7, lines 21-30, the present sequences define two coding single nucleotide 
polymorphisms - specifically, a T/G polymorphism at position 233 of SEQ ID NO: 1 , which can lead to a 
valine or glycine residue at amino acid position 78 of SEQ ID NO:2, and a C/T polymorphism at position 
3 16 of SEQ ID NO: 1 , which can lead to an arginine or cysteine residue at amino acid position 106 of SEQ 
ID NO:2. As such polymorphisms are the basis for forensic analysis, which does not require any 
information at all about the ultimate biological function of the encoded protein, and that is undoubtedly a 
"real world" utility, the presently claimed sequence must in itself be useful. 

Appellants respectfully point out that the presently described polymorphisms arc useful in forensic 
analysis exactly as they were described in the specification as originally filed - specifically, to distinguish 
individual members of the human population from one another based simply on the presence or absence 
of one or more of the described polymorphisms. The skilled artisan would be able to use the presently 
described polymorphisms in forensic analysis exactly as they were described in the specification as 
originally filed, without any additional research. It is important to note that simply because the use of these 
polymorphic markers will necessarily provide additional information on the percentage of particular 
subpopulations that contain these polymorphic markers does not mean that additional research is needed 
in order for these markers as they are presently described in the instant specification to be used in forensic 
science. 

This is also not a case of a potential utility. Even in the worst case scenario, the described 
polymorphisms are each useful to distinguish 50% of the population (in other words, the marker being 
present in half of the population). Appellants point out that the ability of a polymorphic marker to 



distinguish at least 50% of the population is an inherent feature of any polymorphic marker, and this feature 
is well understood by those of skill in the art. Appellants note that as a matter of law, it is well settled that 
a patent need not disclose what is well known in the art. In re Wands, 8 USPQ 2d 1400 (Fed. Cir. 1988). 
Appellants respectfully point out that all that is required to support Appellants' assertion of utility is for the 
skilled artisan to believe that the presently described polymorphic markers could be useful in forensic 
analysis. The fact that forensic biologists use polymorphic markers such as those described by Appellants 
everyday provides more than ample support for the assertion that forensic biologists would also be able 
to use the specific polymorphic markers described by Appellants in the same fashion. Therefore, the 
presently claimed sequence clearly has a substantial and well established utility. 

The Examiner first questioned this asserted utility because there is no "precise information about 
the individual from which a sample under analysis was taken" (the Final Action at page 3). Appellants point 
out that this arguments has absolutely no bearing on the assertion that the polymorphisms described by 
Appellants can be used in forensic analysis. As detailed above, forensic analysis merely determines the 
presence or absence of one or more particular polymorphic markers as a means of distinguishing between 
individuals. As such, forensic analysis requires absolutely no information whatsoever about "information 
about the individual from which a sample under analysis was taken". Thus, the Examiner's argument in no 
way supports the allegation that the present claims lack a patentable utility. 

The Examiner further questioned this asserted utility, stating "(i)t is well known in the art of 
molecular biology that the nucleotide sequences encoding an amino acid sequence of any particular protein 
will have inconsequential differences from individual to individual, as will the amino acid sequences encoded 
thereby. This is why all humans are not all identical and why DNA fingerprinting works" (the Final Action 
bridging pages 2 and 3). However, after this admission that the presently described polymorphic markers 
have a well-established utility in forensic analysis, the Examiner states that this is not a specific utility 
because "almost any cDNA can be employed as a forensic marker in some capacity" (the Final Action at 
page 3). Appellants respectfully point out that this argument is flawed in a number of respects. First, 
Appellants submit that the asserted forensic utility is specific precisely because it cannot be applied to just 
any polynucleotide. In fact, the basis for forensic analysis is the fact that such polymorphic markers are not 



present in all other nucleic acids, but in fact specific and unique to only a certain subset of the population. 

This fact is conceded by the Examiner' s statement that "almost any cDNA can be employed as a forensic 

marker". Second, until a polymorphic marker is actually described it cannot be used in forensic analysis. 

Put another way, simply because there is a likelihood, even a significant likelihood, that a particular nucleic 

acid sequence will contain a polymorphism and thus be useful in forensic analysis, until such a polymorphism 

is actually identified and described, such a likelihood is meaningless . The Examiner appears to be 

attempting to use the information presented for the first time by Appellants in the instant specification as 

hindsight verification that the presendy claimed sequence would be expected to have polymorphic markers. 

Such hindsight analysis based on Appellants discovery is completely improper. Third, the Examiner is 

clearly confusing the requirement for a specific utility, which is the proper standard for utility under 

35U.S.C. § 101, with the requirement for a unique utility, which is clearly an improper standard. The fact 

that other polymorphic markers have been identified in other genetic loci, or that the use of the presently 

described polymorphic markers will provide additional information concerning the prevalence of these 

markers in certain subpopulations, does not mean that use of the polymorphic markers identified by 

Appellants' in SEQ ID NO: 1 in forensic analysis is not a specific utility. As clearly stated by the Federal 

Circuit in Carl Zeiss Stiftung v. Renishaw PLC, 20 USPQ2d 1101 (Fed. Cir. 1991; "Carl Zeiss"): 

An invention need not be the best or only way to accomplish a certain result, and it need 
only be useful to some extent and in certain applications: "[T]he fact that an invention has 
only limited utility and is only operable in certain applications is not grounds for finding a 
lack of utility." Envirotech Corp. v. Al George, Inc., 221 USPQ 473, 480 (Fed. Cir. 
1984) 

In other words, just because other (possibly better) polymorphic markers from the human genome have 
been described, or that additional information about the presently described polymorphic markers can be 
gained through the use of these markers, does not establish that the presently described polymorphic 
markers lack a specific utility. Furthermore, the requirement for a unique utility is clearly not the standard 
adopted by the Patent and Trademark Office. If every invention were required to have a unique utility, the 
Patent and Trademark Office would no longer be issuing patents on batteries, automobile tires, golf balls, 
golf clubs, and treatments for a variety of human diseases, such as cancer, just to name a few particular 
examples, because the utility of each of these compositions is applicable to the broad class in which each 
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of these compositions falls: all batteries have the same utility, specifically to provide electrical power; all 
automobile tires have the same utility, specifically for use on automobiles; all golf balls and golf clubs have 
the same utility, specifically for use in the game of golf; and all cancer treatments have the same utility, 
specifically, to treat cancer. However, only the briefest perusal of virtually any issue of the Official Gazette 
provides numerous examples of patents being granted on each of the above compositions nearly every 
week. Furthermore, if a composition needed to be unique to be patented, the entire class and subclass 
system would be an effort in futility, as the class and subclass system serves solely to group such common 
inventions, which would not be required if each invention needed to have a unique utility. In view of the 
above standards and "common sense" analysis, there can be little question that the present sequence clearly 
meets the requirements of 35 U.S.C. § 101. 

Appellants pointed out in the Response to the Final Action that the holding in the Carl Zeiss case 
is mandatory legal authority that essentially controls the outcome of the present case. This case, and 
particularly the cited quote, directly rebuts the Examiner's argument, which is presumably why the 
Examiner failed to address the holding of Carl Zeiss in the Final Action, and continues to avoid addressing 
Carl Zeiss in the Advisory Action. Instead of addressing Appellants' arguments, the Examiner merely 
rehashes the standard irrelevant arguments concerning general utility - "that any purified compound having 
a known structure could be employed as an analytical standard in such processes as nuclear magnetic 
resonance (NMR), infrared spectroscopy (IR), and mass spectroscopy as well as in polyacrylamide gel 
electrophoresis (PAGE), high performance liquid chromotography (HPLC) and gas chromotography", and 
that "any item having a constant mass within an acceptable range can be employed to calibrate a produce 
scale in a grocery store" (the Final Action bridging pages 3 and 4). These staid arguments are flawed in 
at least two critical respects. First, as pointed out by Appellants above, the admission on the record by 
the Examiner that " almost any cDNA can be employed as a forensic marker in some capacity" (the Final 
Action at page 3, emphasis added), points to the fact that not all nucleic acids have utility in forensic 
analysis. Thus, utility of nucleic acid sequences that contain defined polymorphic markers in forensic 
analysis is not a general utility. Second, the reason that such utilities as those listed by the Examiner are not 
specific is because these general utilities are applicable to a large number of unrelated compositions. Use 



as a calibration standard for a "produce scale" is a utility that is applicable to any composition, no matter 
how unrelated, that has mass. In other words, a metal block, an automobile, an elephant, or a nucleic acid 
molecule containing a polymorphism could be used to calibrate a produce scale, which is why use as a 
calibration standard for a produce scale is not a specific utility. However, a metal block, an automobile, 
or an elephant cannot be used in human forensic analysis. In fact, only nucleic acids, and specifically those 
human nucleic acids that contain a defined polymorphic marker, can be so used. Thus, these arguments 
also fail to support the Examiner's position. 

Appellants respectfully point out that these arguments only serve to highlight the Examiner's general 
lack of understanding of forensic analysis. As repeatedly pointed out by Appellants, forensic analysis does 
not require any knowledge about any function of the expressed polynucleotide, or a correlation between 
the presence of any of these polymorphisms and the effect of the presence of any of these polymorphisms 
on the risk of any disease or disorder. Forensic analysis is used to distinguish individual members of the 
human population from one another based simply on the presence or absence of one or more of the 
described polymorphisms. No more and no less is required. No knowledge about the function of the 
encoded protein is required. No nexus between the polymorphic markers and a specific disease or 
disorder is required. The polymorphic markers described by Appellants do not need to be the best 
polymorphic markers, or the only polymorphic markers - they merely need to function as polymorphic 
markers, which is clearly the case. The present polymorphic markers clearly have utility in forensic analysis, 
and, thus, the claims meet the requirements of 35 U.S.C. § 101. 

Furthermore, Appellants pointed out in the Response to the Final Action as the presently described 

polymorphisms are a part of the family of polymorphisms that have a well-established utility, the Federal 

Circuit's holding in InreBrana, (34USPQ2d 1436 (Fed. Cir. 1995), "Brand') is directly on point. In 

Brana, the Federal Circuit admonished the Patent and Trademark Office for confusing "the requirements 

under the law for obtaining a patent with the requirements for obtaining government approval to market a 

particular drug for human consumption". Brana at 1442. The Federal Circuit went on to state: 

At issue in this case is an important question of the legal constraints on patent office 
examination practice and policy. The question is, with regard to pharmaceutical inventions, 
what must the applicant provide regarding the practical utility or usefulness of the invention 



for which patent protection is sought. This is not a new issue; it is one which we would 
have thought had been settled by case law years ago . 

Brana at 1439, emphasis added. The choice of the phrase "utility or usefulness" in the foregoing quotation 

is highly pertinent. The Federal Circuit is evidently using "utility" to refer to rejections under 

35U.S.C. § 101, and is using "usefulness" to refer to rejections under 35 U.S.C. § 112, first paragraph. 

This is made evident in the continuing text in Brana, which explains the correlation between 35 U.S.C. 

§§ 101 and 112, first paragraph. The Federal Circuit concluded: 

FDA approval, however, is not a prerequisite for finding a compound useful within the 
meaning of the patent laws. Usefulness in patent law, and in particular in the context of 
pharmaceutical inventions, necessarily includes the expectation of further research and 
development . The stage at which an invention in this field becomes useful is well before 
it is ready to be administered to humans. Were we to require Phase II testing in order to 
prove utility, the associated costs would prevent many companies from obtaining patent 
protection on promising new inventions, thereby eliminating an incentive to pursue, through 
research and development, potential cures in many crucial areas such as the treatment of 
cancer. 

Brana at 1 442- 1443 , citations omitted, emphasis added. As set forth above, the present polymorphisms 
are useful in forensic analysis as described in the specification as originally filed, without the need for any 
further research. As discussed above, even if the use of these polymorphic markers provided additional 
information on the percentage of particular subpopulations that contain these polymorphic markers, this 
would not mean that "additional research" is needed in order for these markers as they are presently 
described in the instant specification to be of use to forensic science. As stated above, using the 
polymorphic marker as described in the specification as originally field can definitely distinguish members 
of a population from one another. However, even if, arguendo, further research might be required in 
certain aspects of the present invention, this does not preclude a finding that the invention has utility, as set 
forth by the Federal Circuit's holding in Brana, which clearly states, as highlighted in the quote above, that 
"pharmaceutical inventions, necessarily includes the expectation of further research and development " 
{Brana at 1442-1443, emphasis added). In assessing the question of whether undue experimentation 
would be required in order to practice the claimed invention, the key term is "undue", not 
"experimentation". In reAngstadt and Griffin, 190 USPQ 214 (CCPA 1976). The need for some 



experimentation does not render the claimed invention unpatentable. Indeed, a considerable amount of 

experimentation may be permissible if such experimentation is routinely practiced in the art. In reAngstadt 

and Griffin, supra; Amgen, Inc. v. Chugai Pharmaceutical Co., Ltd., 18 USPQ2d 1016 (Fed. Cir. 

1991). Again, as a matter of law, it is well settled that a patent need not disclose what is well known in the 

art {In re Wands, supra). 

Appellants respectfully point out that the Examiner has provided absolutely no evidence of record 

that would serve to show that an artisan skilled in the art of forensic analysis would doubt Appellants 

asserted utility. As set forth by Appellants in the Response to the Final Action, it has been clearly 

established that a statement of utility in a specification must be accepted absent reasons why one skilled 

in the art would have reason to doubt the objective truth of such statement. In re hanger, 503 F.2d 1380, 

1391, 183USPQ288,297(CCPA, 1974; "Langer"); In re Marzocchi, 439 F.2d 220, 224, 169USPQ 

367, 370 (CCPA, 1971). As set forth in In re Langer (183 USPQ 288 (CCPA 1974); "Langer"): 

As a matter of Patent Office practice, a specification which contains a disclosure of utility 
which corresponds in scope to the subject matter sought to be patented must be taken as 
sufficient to satisfy the utility requirement of § 101 for the entire claimed subject matter 
unless there is a reason for one skilled in the art to question the objective truth of the 
statement of utility or its scope. 

Langer at 297, emphasis in original. As set forth in the MPEP, "Office personnel must provide evidence 
sufficient to show that the statement of asserted utility would be considered ' false ' by a person of ordinary 
skill in the art" (MPEP, Eighth Edition at 2 100-40, emphasis added). Thus, absent such evidence from 
the Examiner concerning the use of the presently described polymorphisms in forensic analysis, the present 
claims clearly meet the requirements of 35 U.S.C. § 101. 

Additionally, in the Response to the Final Action, Appellants pointed out that the specification as 
originally filed indicates that the presently claimed sequence is involved in "chemical communication" 
(specification at page 1 , line 28). Appellants further invited the Examiner' s attention to the fact that a 
sequence sharing 100% percent identity at the protein level over the entire length of the claimed sequence 
is present in the leading scientific repository for biological sequence data (GenBank), and has been 
annotated by third party scientists wholly unaffiliated with Appellants as "Homo sapiens similar to 
olfactory receptor MOR40-13" (GenBank accession number XM_291808; alignment shown in 
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Exhibit A), and two sequences sharing nearly 1 00% percent identity at the protein level over the entire 
length of the claimed sequence are present in the leading scientific repository for biological sequence data 
(GenBank), and have been annotated by third party scientists wholly unaffiliated with Appellants as 
"Homo sapiens similar to olfactory receptor MOR40-13" and "Homo sapiens gene for seven 
transmembrane helix receptor" (GenBank accession numbers XM_062282 and AB0658 12; alignments 
shown in Exhibit B). Furthermore, the murine olfactory receptor sequence referred to above 
(MOR40-13) shares over 84% percent identity at the protein level and 91% similarity at the protein level 
with the claimed sequence (GenBank accession numbers NM_1463 12 and AY07378 1 ; alignments shown 
in Exhibit C). The legal test for utility simply involves an assessment of whether those skilled in the art 
would find any of the utilities described for the invention to be credible or believable . Given these GenBank 
annotations, there can be no question that those skilled in the art would clearly believe that Appellants' 
sequence is an olfactory receptor protein, which is clearly involved in chemical communication. Thus, while 
Appellants have provided evidence of record that conclusively establishes that those skilled in the art would 
believe that the specifically claimed sequence encodes an olfactory receptor protein, the Examiner has 
provided no evidence that directly establi shes that the specifically claimed sequence does not encode an 
olfactory receptor protein. Accordingly, the evidence of record compels a finding that the present invention 
clearly meets the requirements of 35 U.S.C. § 101. 

Furthermore, Appellants respectfully point out that the present case appears to directly track 
Example 10 of the Revised Interim Utility Guidelines Training Materials (Exhibit D), which only requires 
a similarity score greater than 95% to establish functional homology. Thus, the present utility rejection must 
fail as a matter of policy, as a matter of science, and as a matter of law. 

Appellants need only make one credible assertion of utility to meet the requirements of 
35 U.S.C. § 101 (Raytheon v. Roper, 220 USPQ 592 (Fed. Cir. 1983); In re Gottlieb, 140 USPQ 665 
(CCPA 1964); In re Malachowski, 1 89 USPQ 432 (CCPA 1976); Hoffman v. Klaus, 9 USPQ2d 1657 
(Bd. Pat. App. & Inter. 1988)), and thus the question of the utility of the presendy claimed invention should 
be laid to rest. However, as admitted by the Examiner in the First Action, the present application describes 
a novel G-protein coupled receptor. Of the pharmaceutical products currently being market by the entire 



-11- 



industry, 60% of these drugs target G-protein coupled receptors (Gurrath, 2001, Curr. Med. Chem. 
8: 1 605-1.648 ; Exhibit E). Given that more than half of the currently marketed drugs target proteins that 
are structurally (7TM proteins) and functionally (G-protein interaction) related to the presently described 
sequences, a preponderance of the evidence clearly weighs in favor of Appellants' assertion that the skilled 
artisan would readily recognize that the presently described sequences have a specific (the claimed GPCR 
proteins are encoded by a specific locus on the human genome), credible, and well-established utility, for 
example in tracking gene expression. The specification details on page 9, lines 15-17, that the present 
nucleotide sequences have utility in assessing gene expression patterns using high-throughput DNA chips. 
Such "DNA chips" clearly have utility, as evidenced by hundreds of issued U.S. Patents, as exemplified 
by U.S. PatentNos. 5,445,934 (Exhibit F), 5,556,752 (Exhibit G), 5,744,305 (Exhibit H), 5,837,832 
(Exhibit I), 6,156,501 (Exhibit J) and 6,261,776 (Exhibit K). Evidence of the "real world" substantial 
utility of the present invention is further provided by the fact that there is an entire industry established based 
on the use of gene sequences or fragments thereof in a gene chip format. Perhaps the most notable gene 
chip company is Affymetrix. However, there are many companies that have, at one time or another, 
concentrated on the use of gene sequences or fragments, in gene chip and non-gene chip formats, for 
example: Gene Logic, ABI-Perkin-Elmer, HySeq and Incyte. In addition, one such company (Rosetta 
Inpharmatics) was viewed to have such "real world" value that it was acquired by large a pharmaceutical 
company (Merck) for significant sums of money (net equity value of the transaction was $620 million). The 
"real world" substantial industrial utility of gene sequences or fragments would, therefore, appear to be 
widespread and well established. Clearly, there can be no doubt that the skilled artisan would know how 
to use the presently claimed sequences (see Section VHI(B), below), strongly arguing that the claimed 
sequences have utility. Given the widespread utility of such "gene chip" methods usingpublic domain gene 
sequence information, there can be little doubt that the use of the presently described novel sequences 
would have great utility in such DNA chip applications. As the present sequences are specific markers of 
the human genome (see below), and such specific markers are targets for the discovery of drugs that are 
associated with human disease, those of skill in the art would instandy recognize that the present nucleotide 
sequences would be ideal, novel candidates for assessing gene expression using such DNA chips. Clearly, 
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compositions that enhance the utility of such DNA chips, such as the presently claimed nucleotide 
sequences, must in themselves be useful. Thus, the present claims clearly meet the requirements of 
35 U.S.C. § 101. 

Clearly, persons of skill in the art, as well as venture capitalists and investors, readily recognize the 
utility, both scientific and commercial, of genomic data in general, and specifically human genomic data. 
Billions of dollars have been invested in the human genome project, resulting in useful genomic data (see, 
e.g., Venter et al. , 2001 , Science 291 : 1304; Exhibit L). The results have been a stunning success as the 
utility of human genomic data has been widely recognized as a great gift to humanity (see, e.g., Jasny and 
Kennedy, 200 1 , Science 297 : 1 1 53 ; Exhibit M). Clearly, the usefulness of human genomic data, such as 
the presently claimed nucleic acid molecules, is substantial and credible ( worthy of billions of dollars and 
the creation of numerous companies focused on such information) and well-established (the utility of human 
genomic information has been clearly understood for many years). 

As yet a further example of the utility of the presently claimed polynucleotide, Appellants noted in 
the Response to the First Action that the present nucleotide sequence has a specific utility in "mapping a 
unique gene to a particular chromosome", as described in the specification at least at page 3, lines 36-37. 
This is evidenced by the fact that SEQ ID NO: 1 can be used to map SEQ ID NO: 1 to chromosome 1 1 
(present within two independent chromosome 1 1 clones; GenBank Accession Numbers AC1 16156 and 
AC 109341 ; alignments and the first page from the GenBank reports are presented in Exhibit N). Clearly, 
the present polynucleotide provides exquisite specificity in localizing the specific region of human 
chromosome 1 1 that contains the gene encoding the given polynucleotide, a utility not shared by virtually 
any other nucleic acid sequences. In fact, it is this specificity that makes this particular sequence so useful. 
Early gene mapping techniques relied on methods such as Giemsa staining to identify regions of 
chromosomes. However, such techniques produced genetic maps with a resolution of only 5 to 10 
megabases, far too low to be of much help in identifying specific genes involved in disease. The skilled 
artisan readily appreciates the significant benefit afforded by markers that map a specific locus of the human 
genome, such as the present nucleic acid sequence. 

Appellants respectfully reminded the Examiner that only a minor percentage (2-4%) of the genome 
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actually encodes exons, which in-turn encode amino acid sequences. The presently claimed polynucleotide 
sequence provides biologically validated empirical data (e.g., showing which sequences arc transcribed and 
polyadenylated) that specifically define that portion of the corresponding genomic locus that actually 
encodes exon sequence, as described above. Appellants respectfully submit that the practical scientific 
value of biologically validated , expressed and polyadenylated mRNA sequences is readily apparent to those 
skilled in the relevant biological and biochemical arts. For further evidence in support of the Appellants' 
position, the Board is requested to review, for example, section 3 of Venter et al. (supra at 
pp. 1317-1321, includingFig. 11 at pp. 1324- 1325; see Exhibit L), which demonstrates the significance 
of expressed sequence information in the structural analysis of genomic data. The presently claimed 
polynucleotide sequence defines a biologically validated sequence that provides a unique and specific 
resource for mapping the genome essentially as described in the Venter et al article. Thus, the present 
claims clearly meet the requirements of 35 U.S.C. § 101. 

The Examiner' s main argument concerning these asserted utilities is that, once again, other nucleic 
acid sequences can be used in a similar fashion - "almost any cDNA can be ... used as a chromosomal or 
tissue marker or in a gene chip for expression profiling" (the Final Action at page 3). Appellants once again 
point out that these arguments are completely rebuffed by the Federal Circuit's holding in Carl Zeiss, supra 
("[A]n invention need not be the best or only way to accomplish a certain result"). 

Regarding the utility requirements under 35 U.S.C. § 10 1 , the Federal Circuit has clearly stated 
"(t)he threshold of utility is not high: An invention is 'useful' under section 101 if it is capable of providing 
some identifiable benefit." Juicy Whip Inc. v. Orange Bang Inc., 185 F.3d 1364, 51 USPQ2d 1700 
(Fed. Cir. 1999) (citing Brenner v. Manson, 383 U.S. 519, 534 (1966)). Additionally, the Federal Circuit 
has stated that "(t)o violate § 101 the claimed device must be totally incapable of achieving a useful result." 
Brooktree Corp. v. Advanced Micro Devices, Inc., 977F.2d 1555, 1571, 24USPQ2d 1401 (Fed. Cir. 
1992), emphasis added. Cross v. lizuka (753 F.2d 1040, 224 USPQ 739 (Fed. Cir. 1985); "Cross") 
states "any utility of the claimed compounds is sufficient to satisfy 35 U.S.C. § 101". Cross at 748, 
emphasis added. Indeed, the Federal Circuit recently emphatically confirmed that "anything under the sun 
that is made by man" is patentable (State Street Bank & Trust Co. v. Signature Financial Group Inc., 
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149 F.3d 1368, 47 USPQ2d 1596, 1600 (Fed. Cir. 1998), citing the U.S. Supreme Court's decision in 
Diamond vs. Chakrabarty, 447 U.S. 303, 206 USPQ 193 (U.S., 1980)). Thus, based on the relevant 
case law, the present claims clearly meet the requirements of 35 U.S.C. § 101. 

Finally, While Appellants are well aware of the new Utility Guidelines set forth by the USPTO, 
Appellants respectfully point out that the current rules and regulations regarding the examination of patent 
applications is and always has been the patent laws as set forth in 35 U.S.C. and the patent rules as set 
forth in 37 C.F.R., not the Manual of Patent Examination Procedure or particular guidelines for patent 
examination set forth by the USPTO. Furthermore, it is the job of the judiciary, not the USPTO, to 
interpret these laws and rules. Appellants are unaware of any significant recent changes in either 
35 U.S.C. § 101 , or in the interpretation of 35 U.S .C. § 101 by the Supreme Court or the Federal Circuit 
that is in keeping with the new Utility Guidelines set forth by the USPTO. This is underscored by numerous 
patents that have been issued over the years that claim nucleic acid fragments that do not comply with the 
new Utility Guidelines. As examples of such issued U.S. Patents, the Board is invited to review U.S. Patent 
Nos. 5,817,479 (Exhibit O), 5,654,173 (Exhibit?), and 5,552,281 (Exhibit Q; each of which claims 
short polynucleotides), and recently issued U.S. PatentNo. 6,340,583 (Exhibit R; which includes no 
working examples), none of which contain examples of the "real-world" utilities that the Examiner seems 
to be requiring. Additionally, the Office has recently issued U.S. Patent 6,043,052 (Exhibit S), which 
concerns an "orphan" G-Protein coupled receptor identified based only on homology to the orphan 
receptor GPR25, similar to the situation with Appellants' currently claimed sequence. Importantly, this 
issued patent also contains no examples of the "real world" utilities seemingly requited in the present case. 
As issued U.S. Patents are presumed to meet all of the requirements for patentability, including 
35 U.S.C. §§ 101 and 112, first paragraph (see Section Vm(B), below), Appellants submit that the 
present polynucleotides must also meet the requirements of 35 U.S.C. § 101 . While Appellants understand 
that each application is examined on its own merits, Appellants are unaware of any changes to 
35U.S.C. § 101, orin the interpretation of 35 U.S.C. § 101 by the Supreme Court or the Federal Circuit, 
since the issuance of these patents that render the subject matter claimed in these patents, which is similar 
to the subject matter in question in the present application, as suddenly non-statutory or failing to meet the 
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requirements of 35 U.S.C. § 101. Thus, holding Appellants to a different standard of utility would be 
arbitrary and capricious, and, like other clear violations of due process, cannot stand. 

For each of the foregoing reasons, Appellants submit that the rejection of claims 1-5 under 
35 U.S.C. § 101 must be overruled. 

B. Are Claims 1-5 Unusable Due to a Lack of Patentable Utility? 

The Final Action next rejects claims 1-5 under 35 U.S.C. § 1 12, first paragraph, since allegedly 
one skilled in the art would not know how to use the invention, as the invention allegedly is not supported 
by either a clear asserted utility or a well-established utility. 

The arguments detailed above in Section VIII(A) concerning the utility of the presently claimed 
sequences are incorporated herein by reference. As the Federal Circuit and its predecessor have 
determined that the utility requirement of Section 101 and the how to use requirement of Section 112, first 
paragraph, have the same basis, specifically the disclosure of a credible utility (In re Brana, supra; In re 
Jolles, 628 F.2d 1322, 1326 n.ll, 206 USPQ 885, 889 n.ll (CCPA 1980); In re Fouche, 439 F.2d 
1237, 1243, 169 USPQ 429, 434 (CCPA 1971)), Appellants submit that as claims 1-5 have been shown 
to have "a specific, substantial, and credible utility", as detailed in Section Vm(A) above, the present 
rejection of claims 1-5 under 35 U.S.C. § 112, first paragraph, cannot stand. 

Appellants therefore submit that the rejection of claims 1-5 under 35U.S.C. § 112, first paragraph, 
must be overruled. 
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IX. APPENDIX 

The claims involved in this appeal are as follows: 

1 . (Previously Presented) An isolated nucleic acid molecule comprising a nucleotide sequence that 
encodes the amino acid sequence of SEQ ID NO:2. 

2. (Previously Presented) An isolated nucleic acid expression vector comprising a nucleotide 
sequence encoding the amino acid sequence of SEQ ID NO: 2, said vector having the property of being 
capable of expressing the amino acid sequence of SEQ ID NO: 2 when present in a suitable host cell. 

3. (Previously Presented) The isolated nucleic acid molecule of claim 1 , wherein said nucleotide 
sequence comprises the sequence of SEQ ID NO:l. 

4. (Previously Presented) The isolated nucleic acid expression vector of claim 2, wherein said 
nucleotide sequence comprises the sequence of SEQ ID NO:l. 

5. (Previously Presented) A host cell comprising the expression vector of claim 2. 



-17- 



X. CONCLUSION 

Appellants respectfully submit that, in light of the foregoing arguments, the Final Action's conclusion 
that claims 1-5 lack a patentable utility and are unusable by the skilled artisan due to a lack of patentable 
utility is unwarranted. It is therefore requested that the Board overturn the Final Action's rejections. 



Respectfully submitted, 
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>XM_291808 ACCESSION:XM_291808 NID: gi 29743652 ref XM_291808.1 

Homo sapiens similar to olfactory receptor MOR40-13 [Mus 
musculus] (LOC340982), mRNA 
Length = 975 

Score = 644 bits (1643), Expect = 0.0 

Identities = 324/324 (100%) , Positives = 324/324 (100%) 
Frame = +1 

Query: 1 MNHMSASLKISNSSKFQVSEFILLGFPGIHSWQHWLSLPLALLYLSALAANTLILIIIWQ 60 

MNHMSASLKISNSSKFQVSEFILLGFPGIHSWQHWLSLPLALLYLSALAANTLILIIIWQ 
Sbjct: 1 MNHMSASLKISNSSKFQVSEFILLGFPGIHSWQHWLSLPLALLYLSALAANTLILIIIWQ 180 

Query: 61 NPSLQQPMYIFLGILCMVDMGLATTI I PKILAIFWFDAKVI SLPERFAQI YAIHFFVGME 120 

NPSLQQPMYIFLGILCMVDMGLATTI I PKILAIFWFDAKVI SLPERFAQI YAIHFFVGME 
Sbjct: 181 NPSLQQPMY I FLG I LCMVT)MGLATTI I PKILAIFWFDAKVI SLPERFAQI YAIHFFVGME 360 

Query: 121 SGILLCMAFDRYVAICHPLRYPSIVTSSLILKATLFMVLRNGLFVTPVPVLAAQRDYCSK 180 

SGILLCMAFDRYVAICHPLRYPSIVTSSLILKATLFMVLRNGLFVTPVPVLAAQRDYCSK 
Sbjct: 361 SGILLCMAFDRYVAICHPLRYPSIVTSSLILKATLFMVLRNGLFVTPVPVLAAQRDYCSK 540 

Query: 181 NEIEHCLCSNLGVTSLACDDRRPNSICQLVLAWLGMGSDLSLIILSYILILYSVLRLNSA 240 

NEIEHCLCSNLGVTSLACDDRRPNSICQLVLAWLGMGSDLSLIILSYILILYSVLRLNSA 
Sbjct: 541 NEIEHCLCSNLGVTSLACDDRRPNSICQLVLAWLGMGSDLSLIILSYILILYSVLRLNSA 720 

Query: 241 EAAAXALSTCSSHLTLILFF YTIVWI SVTHLTEMKATLI PVLLNVLHNI I PPSLNPTVY 300 

EAAAKALSTCSSHLTLILFFYTIVWISVTHLTEMKATLIPVLLNVXHNIIPPSLNPTVY 
Sbjct: .721 EAAAKALSTCSSHLTLILFFYTIVWISVTHLTEMKATLIPVLLNVljHNIIPPSLNPTW 900 

Query: 301 ALQTKELRAAFQKVLFALTKE I RS 324 

ALQTKELRAAFQKVLFALTKEIRS 
Sbjct: 901 ALQTKELRAAFQKVLFALTKEIRS 972 



>XM_062282 ACCESSION: XM_0 62 2 82 NID: gi 29746563 ref XM_062282.7 

Homo sapiens similar to olfactory receptor MOR40-13 [Mus 
musculus] (LOC120806), mRNA 
Length = 975 

Score. = 641 bits (1635), Expect = 0.0 

Identities =323/324 (99%), Positives = 323/324 (99%) 

Frame = +1 

Query: 1 MNHMSASLKISNSSKFQVSEFILLGFPGIHSWQHWLSLPLALLYLSALAANTLILIIIWQ 60 

>INHMSASLKISNSSKFQVSEFILLGFPGIHSWQHWLSLPLALLYLSALAANTLILIIIWQ 
Sbjct: 1 MNHMSASLKISNSSKFQVSEFILLGFPGIHSWQHWLSLPLALLYLSALAANTLILIIIWQ 180 

Query: 61 NPSLQQPMYIFLGILCMVX)MGLATTIIPKILAIFWFDAKVISLPERFAQIYAIHFFVGME 120 

NPSLQQPMYIFLGILCMVDMGLATTIIPKILAIFWFDAKVISLPE FAQIYAIHFFVGME 
Sbjct: 181 NPSLQQPMYIFLGILCMVTDMGLATTI I PKILAIFWFDAKVISLPEC FAQIYAIHFFVGME 360 

Query: 121 SGILLCMAFDRYVAICHPLRYPSIVTSSLILKATLFMVX.RNGLFVTPVPVLAAQRDYCSK 180 

SGILLCMAFDRYVAICHPLRYPSIVTSSLILKATLFMVX-RNGLFVTPVPVLAAQRDYCSK 
Sbjct: 361 SGILLCMAFDRYVAICHPLRYPSIVTSSLILKATLFMVLRNGLFVTPVPVLAAQRDYCSK 540 

Query: 181 NEIEHCLCSNLGVTSLACDDRRPNSICQLVLAWLGMGSDLSLIILSYILILYSVLRLNSA 240 

NEIEHCLCSNLGVTSLACDDRRPNSICQLVLAWLGMGSDLSLIILSYILILYSVLRLNSA 
Sbjct: 541 NEIEHCLCSNLGVTSLACDDRRPNSICQLVLAWLGMGSDLSLIILSYILILYSVLRLNSA 720 

Query: 241 EAAAKALSTCSSHLTLILFFYTIVWISVTHLTEMKATLIPVLLNVLHNIIPPSLNPTVY 300 

EAAAKALSTC S SHLTL I LFF YT I VWI S VTHLTEMKATL I PVLLNVLHNI I P P SLNPTVY 
Sbjct:. 721 EAAAKALSTC S SHLTL I LFF YT I VVVISVTHLTEMKATL I PVIjLIWLHNI I PPSLNPTVY 900 

Query: 301 ALQTKELRAAFQKVLFALTKEIRS 324 

ALQTKELRAAFQKVLFALTKEIRS 
Sbjct: 901 ALQTKELRAAFQKVLFALTKEIRS 972 



>AB065812 ACCESSION: AB065812 NID: gi 21928889 dbj AB065812.1 Homo 
sapiens gene for seven transmembrane helix receptor, 
complete cds, isolate : CBRC7TM_375 
Length = 13 66 

Score = 641 bits (1635), Expect = 0.0 

Identities = 323/324 (99%), Positives = 323/324 (99%) 

Frame = +3 

Query: 1 MNHMSASLKISNSSKFQVSEFILLGFPGIHSWQHWLSLPLALLYLSALAANTLILIIIWQ 60 

MNHMSASLKISNSSKFQVSEFILLGFPGIHSWQHWLSLPLALLYLSALAANTLILIIIWQ 
Sbjct: 192 MNHMSASLKISNSSKFQVSEFILLGFPGIHSWQHWLSLPLALLYLSALAANTLILIIIWQ 371 

Query: 61 NPSLQQPMYIFLGILCMVDMGLATTI I PKILAIFWFDAKVISLPERFAQIYAIHFFVGME 120 

NPSLQQPMYIFLGILCMVDMGLATTI I PKILAIFWFDAKVI SLPE FAQ I Y A I HF F VGME 
Sbjct: 372 NPSLQQPMYIFLGILCMVDMGLATTI I PKILAIFWFDAKVI SLPEC FAQ I YAIHFFVGME 551 

Query: 121 SGILLCMAFDRYVAICHPLRYPSIVTSSLILKATLFMVLRNGLFVTPVPVLAAQRDYCSK 180 

SGILLCMAFDRYVAICHPLRYPSIVTSSLILKATLFMVLRNGLFVTPVPVLAAQRDYCSK 
Sbjct: 552 SGILLCMAFDRYVAICHPLRYPSIVTSSLILKATLFMVLRNGLFVTPVPVLAAQRDYCSK 731 

Query: 181 NEIEHCLCSNLGVTSLACDDRRPNSICQLVLAWLGMGSDLSLIILSYILILYSVLRLNSA 240 

NEIEHCLCSNLGVTSLACDDRRPNSICQLVLAWLGMGSDLSLIILSYILILYSVLRLNSA 
Sbjct: 732 NEIEHCLCSNLGVTSLACDDRRPNSICQLVLAWLGMGSDLSLIILSYILILYSVLRLNSA 911 

Query: 241 EAAAKALSTCSSHLTLILFFYTIVWISVTHLTEMKATLIPVLLNVLHNIIPPSLNPTVY 300 

EAAAKALSTCSSHLTLILFFYTIVWI SVTHLTEMKATLI PVLLNVLHNI I PPSLNPTVY 
Sbjct: 912 EAAAKALSTCSSHLTLILFFYTIVWISVTHLTEMKATLIPVLLNVLHNIIPPSLNPTVY .1091 

Query: 301 ALQTKELRAAFQKVLFALTKEIRS 324 

ALQTKELRAAFQKVLFALTKEIRS 
Sbjct: 1 0 9 2 ALQTKELRAAFQKVLFALTKEIRS 1163 



>NM_146312 ACCESSION:NM_146312 NID: gi 22129666 ref NM__146312 . 1 Mus 
musculus olfactory receptor MOR40-13 (MOR40-13) , mRNA 
Length = 960 

Score = 532 bits (1355), Expect = e-149 

Identities = 264/312 (84%) , Positives = 286/312 (91%) 



Frame 


= +1 




Query : 


4 


MSASLKISNSSKFQVSEFILLGFPGIHSWQHWLSLPLALLYLSALAANTLILIIIWQNPS 


63 






MSASLK NSSK QVSEFILLGFPGIHSWQHWLSLP LLYLSA+ N LILIII Q+PS 




Sbjct : 


1 


MSASLKAFNSSKSQVSEFILLGFPGIHSWQHWLSLPFTLLYLSAIGTNVLILIIICQDPS 


180 


Query : 


64 


LQQPMYIFLGILCMVDMGLATTI I PKILAIFWFDAKVI SLPERFAQI YAIHFFVGMESGI 


123 






L+QPMY+FLGIL +VDMGLATT I + PKILAIFWFDAKVI SLPE FAQIYAIH FVGMESGI 




Sbjct: 


181 


LKQPMYLFLGILSWDMGLATTIMPKILAIFWFDAKVISLPECFAQIYAIHCFVGMESGI 


360 


lprv • 
^uci y > 


124 


LLCMAFDRWAICHPLRYPSIVTSSLILKATLFMVLRNGLFVTPVPVLAAQRDYCSKNEI 


183 






LCMAFDRYVAIC + PLRY SI+T+SLILKATLFMVLRNGL V PVPVLAAQR+YCS+NEI 




Sbj ct : 


361 


FLCMAFDRWAICYPLRYSSIITNSLILKATLFMVTiRNGLCVIPVPVLAAQRNYCSRNEI 


540 


Query: 


184 


EHCLCSNLGVTSLACDDRRPNSICQLVLAWLGMGSDLSLIILSYILILYSVLRLNSAEAA 


243 






+ HC LC SNLGVT S LAC DDRR PNS I C QL + L AW+ GMGS DL LIILSY LIL SVLRLNSAEA 




Sbjct: 


541 


DHCLCSNLGVTSLACDDRRPNSICQLILAWVGMGSDLGLIILSYTLILRSVLRLNSAEAV 


720 


Query: 


244 


AKALSTCSSHLTLILFFYTIVWISVTHLTEMKATLIPVX.LNVliHNIIPPSLNPTVYALQ 


303 






+KAL+TCSSHL LILFFYT+VWISVTHL+E KATLIPVLLNV+HNI PPSLNP VYAL+ 




Sbjct: 


721 


SKALNTCSSHLILILFFYTVWVISVTHLSETKATLIPVIjLNWHNITPPSLNPIWAL 


900 


Query : 


304 


TKELRAAFQKVL 315 








T++LR FQKVL 




Sbjct: 


901 


TRQLRQGFQKVL 936 





>AY073781 ACCESSION: AY073781 NID: gi 18480859 gb AY073781.1 Mus 

musculus olfactory receptor MOR40-13 gene, complete cds 
Length = 960 

Score = 532 bits (1355), Expect = e-149 

Identities = 264/312 (84%) , Positives = 28.6/312 (91%) 



Frame 


= +1 




Query : 


4 


MSASLKISNSSKFQVSEFILLGFPGIHSWQHWLSLPLALLYLSALAANTLILIIIWQNPS 


63 






MSASLK NSSK QVSEFILLGFPGIHSWQHWLSLP LLYLSA+ N LILIII Q+PS 




Sbjct : 


1 


MSASLKAFNSSKSQVSEFILLGFPGIHSWQHWLSLPFTLLYLSAIGTNVLILIIICQDPS 


180 


yuciy * 


64 


LQQPMYIFLGILCMVTDMGLATTIIPKILAIFWFDAKVISLPERFAQIYAIHFFVGMESGI 


123 






L+QPMY+FLGIL +VDMGLATTI +PKILAIFWFDAKVI SLPE FAQIYAIH FVGMESGI 




Sbjct: 


181 


LKQPMYLFLGILSWDMGLATTIMPKILAIFWFDAKVISLPECFAQIYAIHCFVGMESGI 


360 


Query : 


124 


LLCMAFDRWAICHPLRYPSIVTSSLILKATLFMVIjRNGLFVTPVPVLAAQRDYCSKNEI 


183 






LCMAFDRYVAIC + PLRY S I +T+SLILKATLFMVLRNGL V PVPVLAAQR+YCS+NEI 




Sbjct : 


361 


FLCMAFDRYVAICYPLRYS S I ITNSLILKATLFKVLRNGLCVI PVPVLAAQRNYCSRNEI 


540 


Query: 


184 


EHCLCSNLGVTSLACDDRRPNSICQLVLAWLGMGSDLSLIILSYILILYSVLRLNSAEAA 


243 






+HCLCSNLGVTSLACDDRRPNSICQL+LAW+GMGSDL LIILSY LIL SVLRLNSAEA 




Sbjct: 


541 


DHCLCSNLGVTSLACDDRRPNSICQLILAWVGMGSDLGLIILSYTLILRSVLRLNSAEAV 


720 


Query: 


244 


AKAL S TC S SHLTL I L FF YT IWVI S VTHLTEMKATL I PVLLNVLHNI I P P SLNPTVYALQ 


303 






+KAL+TCSSHL LILFFYT+VWISVTHL+E KATLIPVLLNV+HNI PPSLNP VYAL+ 




Sbjct: 


721 


SKALNTCSSHLILILFFYTVVVVISVTHLSETKATLIPVLLNVMHNITPPSLNPIVYALR 


900 


Query: 


304 


TKELRAAFQKVL 315 








T++LR FQKVL 




Sbjct: 


901 


TRQLRQGFQKVL 936 





characterize the protein. A starting material that can only be used to produce 
a final product does not have a substantial asserted utility in those instances 
where the final product is not supported by a specific and, substantial utility. 
In this case none of the proteins that are to be produced as final products 
resulting from processes involving the claimed cDNA have asserted or 
identified specific and substantial utilities. The research contemplated by 
Applicants to characterize potential protein products, especially their 
biological activities, does not constitute a specific and substantial utility. 
Identifying and studying the properties of the protein itself or the 
mechanisms in which the protein is involved does not define a "real world- 
context of use. Note, because the claimed invention is not supported by a 
specific and substantial asserted utility for the reasons set forth above, 
credibility has not been assessed. Neither the specification as filed nor any 
art of record discloses or suggests any property or activity for the cDNA 
compounds such that another non-asserted utility would be well established 
for the compounds. 

Claim 1 is also rejected under 35 U.S.C. § 1 12, first paragraph. 
Specifically, since the claimed invention is not supported by either a specific 
and substantial asserted utility or a well established utility for the reasons set 
forth above, one skilled in the art would not know, how to use the claimed 
invention. 

Example 10: DNA Fragmen t encoding a Full Open Reading Frame 
(ORF) 

Specification: The specification discloses that a cDNA library was prepared 
from human kidney epithelial cells and 5000 members of this library were 
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sequenced and open reading frames were identified. The specification 
discloses a Table that indicates that one member of the library having SEQ 
ID NO; 2 has a high level of homology to a DNA ligase. The specification 
teaches that this complete ORF (SEQ ID NO: 2) encodes SEQ ID NO: 3. 
An alignment of SEQ ID NO: 3 with known amino acid sequences of DNA 
ligases indicates that there is a high level of sequence conservation between 
the various known ligases. The overall level of sequence similarity between 
SEQ ID NO: 3 and the consensus sequence of the known DNA ligases that 
are presented in the specification reveals a similarity score of 95%. A search 
of the prior art confirms that SEQ ID NO: 2 has high homology to DNA 
Ligase encoding nucleic acids and that the next highest level of homology is 
to alpha-actin. However, the latter homology is only 50%. Based on the 
sequence homologies, the specification asserts that SEQ ID NO: 2 encodes a 
DNA ligase. 

Claim 1: An isolated and purified nucleic acid comprising SEQ ID NO: 2. 

Analysis: The following analysis includes the questions that need to be 
asked according to the guidelines and the answers to those questions based 
on the above facts: 

1) Based on the record, is there a "well established utility" for the 
claimed invention? Based upon applicant's disclosure and the results of the 
PTO search, there is no reason to doubt the assertion that SEQ ID NO: 2 
encodes a DNA ligase. Further, DNA ligases have a well-established use in 
the molecular biology art based on this class of protein's ability to ligate 
DNA. Consequently the answer to the question is yes. 
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Note that if there is a well-established utility already associated with the 
claimed invention, the utility need not be asserted in the specification as 
filed. In order to determine whether the claimed invention has a well- 
established utility the examiner must determine that the invention has a 
specific, substantial and credible utility that would have been readily 
apparent to one of skill in the art. In this case SEQ ID NO: 2 was shown to 
encode a DNA ligase that the artisan would have recognized as having a 
specific, substantial and credible utility based on its enzymatic activity. 

Thus, the conclusion reached from this analysis is that a 35 U.S.C. § 
101 rejection and a 35 U.S.C. § 1 12, first paragraph, utility rejection should 
not be made. 

Example 11: Animals with TTnrharacterized Human Genes 

Specification: Kidney cells from a patient with Polycystic Kidney (PCK) 
Disease have been used to make a cDNA library. From this library 8000 
nucleotide "fragments" have been sequenced but not yet used to express 
proteins in a transformed host cell nor have they been characterized in any 
other way. The 50 longest fragments, SEQ ID NO: 1-50, respectively, have 
been used to make transgenic mice. None of the 50 lines of mice have 
developed Polycystic Kidney Disease to date. The asserted utility is the use 
of the mice to research human genes from diseased human kidneys. The 
disease is inheritable, but chromosomal loci have not yet been identified. 
Neither the absence or presence of a specific protein has been identified with 
the disease condition. 
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Abstract- Over the last decades distinct members of the G Protein-Coupled Receptor 
(GPCR) family emerged as prominent drug targets within pharmaceutical research, s.nce 
approximately 60 % of marketed prescription drugs act by selectively addressmg 
representatives of that class of transmembrane signal transduction systems. It is 
noteworthy that the majority of GPCR-targeted drugs elicit their bio og.cal I activity by 
selective agonism or antagonism of biogenic monoamine receptors, while the development 
status of peptide-binding GPCR-adressing compounds is still in Us infancy. 
Exemplified on selected medicinal chemistry projects, this review will focus on the opportunities of 
fteSc Intervention into a broad spectrum of disease processes through agonmng or antagon.zmg he 
S ^.ide-binding GPCRs. In this context, a brief overview tfOK^^^ 
pathways will be given in order to emphasize the biomedical relevance of a controlled modulat.on of receptor 
utcZ Modern trends on .ead finding and optimization strategies for peptide-b.nd.ng GPCR-targeted Mow- 
molecular weight compounds will be highlighted on the basis of current research programs conducted m the 
"rt of a r g iotn S i„ J. endothe.in. bradykinin. neurokinin, neuropeptide Y, LHRH. C5a antagon^s and 
somatostatin aeonists respectively. Special emphasis will be laid on the elaboration and utilization of 
ZSTHS^Z potential 'drug candidates, thus facilitating more detailed insights ,nto the 
underlying molecular recognition event. 



INTRODUCTION 

Current pharmaceutical research is going through a period 
of unprecedented change, since new revolutionizing 
techniques have been successfully implemented into the 
pharmaceutical discovery process. At the same time, 
pharmaceutical industry feels growing pressure to release 
more new chemical entities (NCEs) that evolve as highly 
selective drugs targeting therapeutic areas of unmet medical 
need and address novel mechanisms of action. These 
attributes clearly define an ideal set of preconditions for 
positioning a candidate with block buster potential onto the 
drug market [1-3 J. The conceptual combination of automated 
combinatorial chemistry, multiple parallel synthesis with 
high-throughput screening has dramatically altered the 
process of lead finding in medicinal chemistry in that vast 
numbers of low molecular weight compounds can rapidly be 
screened against biological target systems [4]. This progress 
in medicinal chemistry is paralleled on the side of target, 
identification and validation with the maturation of 
genomics, proteomics, and bioinformatics in pharmaceutical 
research [5 J. Taken together, these novel methodologies are 
expected to facilitate and accelerate the overall drug discovery 
process significantly. 

However, the judicious choice of a disease relevant target 
is still one of the most crucial steps in initiating a drug 
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discovery project, both in terms of novelty and uniqueness of 
the underlying therapeutic principle/ as well as the 
competitor situation [2]. 

In this context, the superfamily of transmembrane G 
protein-coupled receptors (GPCRs) emerged as the most 
prominent class of qualified drug targets for pharmaceutical 
research and biomedical application [6]. Approximately 60% 
of all commercially available drugs work by selective 
modulation of distinct members of this target family [7]. 
Even though an estimated number of 1000 to 2000 GPCRs 
is expected to exist in the human genome [8], current 
GPCR-targeted therapeutic principles exploit a surprisingly 
small fraction of the GPCR family known today. A strong 
bias exists among the GPCR-targeted drugs in favour of the 
subclass of biogenic monoamine-stimulated GPCRs, i.e. the 
classical neurotransmitter-binding receptors [9,10]. 

This review will focus on the opportunity to further 
expand the spectrum of drug-targeted GPCRs onto the huge 
subclass of peptide-binding representatives of that target 
family. After a brief introduction on the basic principles of 
receptor structure and function, the chemically diverse set of 
endogenous ligands will be discussed with the aim to 
emphasize the relevance of peptide-binding GPCRs for 
modem drug discovery. 

The lead identification and optimization attempts 
discussed in this contribution are restricted on projects that 
are aimed to identify peptidomimetic or non-peptide agonists 
or antagonists. Numerous pharmaceutical research efforts 
conducted over the last two decades have clearly proven the 
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relevance of an early pharmacokinetic profiling. 
Consequently, satisfactory metabolic stability and oral 
bioavailability demand a transfer of the peptide-encoded 
biological and structural information onto non-peptide, drug- 
like scaffolds in order to achieve the desired goal [1 1-13]. 

Classical attempts towards drugs selectively addressing 
peptide-binding GPCRs will be exemplified on the 
angiotensin II and endothelin receptor antagonists. In both 
areas, leads were identified by screening programs and further 
optimized by classical medicinal chemistry approaches to 
yield clinical candidates, some of which already entered the 
market. The classical approach of optimizing screening hits 
will further be introduced with medicinal chemistry 
programs aimed to identify active compounds for a 
modulation of the bradykinin, neurokinin, and NPY 
(neuropeptide Y) receptors. Since the area of peptide-binding 
GPCR compounds is still in its infancy, especially when 
compared to the situation of biogenic amine-binding receptor 
drugs, the actual state of the majority of projects discussed in 
this review is still in the preclinical or in early clinical 
phases. Apart from random lead finding attempts, structural 
rationales are more frequently used in recent times, 
precedented by studies on somatostatin, bradykinin, 
neurokinin, LHRH (luteinizing hormone-releasing hormone), 
and anaphylatoxin C5a receptor agonists and antagonists that 
will be discussed briefly. Structural rationales were mainly 
derived from an educated guess on the bioactive 
conformation of the endogenous peptide or protein ligand, 
thus offering the opportunity to follow an indirect drug 
design approach. 




GPCR SUPERFAMJLY 

G protein-coupled receptors constitute the largest receptor 
family known today [8]. According to an analysis of the C. 
elegans genome (14], approximately 5% of the 19100 
nematode genes encode GPCRs with a family distribution 
profile that is reminiscent to that of mammalian GPCR 
genes. Extrapolation of these findings would suggest that up 
to 5000 distinct GPCR-encoding genes exist within the 
human genome (5% of an estimated 100000 genes). 
Currently, more than 800 distinct members of the GPCR 
superfamily have been cloned from various species, ranging 
from fungi over plants, yeast, slime mould, protozoa, 
metazoa to humans. Apart from the sensory olfactory 
receptors, approximately 150 human GPCRs have been 
cloned for which also the endogenous Iigands have been 
identified. Further, more than 100 GPCRs are known with 
unidentified Iigands and unknown physiological relevance, 
so called orphan GPCRs, which undoubtedly represent a rich 
source of disease-relevant drug targets for fiiture biomedical 
research [15-17]. 



Structure and Function of GPCRs 

GPCRs belong to the class of integral plasma membrane 
proteins and share a common receptor protein topology 
throughout the entire family. The structure paradigm is a 
seven helix bundle that spans the cell membrane in an 
almost perpendicular orientation, thereby establishing a 
functional link between the exterior and the cytoplasm of the 




Fig. (1). Side-by-side stereo presentation of the Ca trace model of rhodopsin derived from various biophysical and bioinformatics 
studies. The helix bundle is shown in a sideview, the extracellular compartment being on the top. For details see references [22-31]. 
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cell [6,18-20]. The seven transmembrane sequence stretches 
can be identified by hydrophobic ity analyses since they 
exhibit an increased hydrophobic signature in a 
corresponding hydrophobicity profile. From numerous 
biophysical and biochemical studies it is now genera! 
accepted that GPCRs intercalate into the cell membrane with 
their ^-terminus in the extracellular compartment, while the 
C-terminus reaches into the cytoplasm of the cell. The seven 
transmembrane helices (7TM domain) that constitute the 
central core domain of all GPCRs, are sequentially connected 
by extracellular and intracellular loops. Apart from variations 
in the primary structure, GPCRs differ in length of these 
loops, as well as in length and function of both N- and C- 
termini. The ACTH (adrenocorticotropic hormone) receptor 
is one of the smallest GPCRs known with 297 residues. 
Biogenic monoamine receptor sequences cover a size from 
approximately 350 to 600 residues, peptide receptor 
sequences are found between 400 and 750 residues, while the 
mGluRs (metabotropic glutamate receptor) mark the upper 
boundary consisting of roughly 1200 amino acid residues 
[21]. 

Even though no high-resolution structure of any 
pharmaceutical relevant member of the GPCR superfamily 
has been determined by e.g. x-ray crystallography, low 
resolution models derived from electron cryo-microscopy and 
electron diffraction of bovine, frog and squid rhodopsin reveal 
a detailed picture of the insertion mode of each helix within 
the context of the transmembrane helix bundle domain (Fig. 
(I)) [22-31]. 

From a functional point of view, GPCRs share a 
common property in that they work as transmembrane 



transducer systems by transferring an extracellular message 
across the cell membrane, thus allowing the affected tissue to 
respond to a broad range of signalling molecules [32-35]. 
Upon extracellular binding of the molecular stimulus, the 
central core domain (7TM domain) is believed to undergo a 
conformational change, thereby transmitting the extracellular 
binding event into the cytoplasm (Fig: (2)). The binding of 
a receptor agonist leads to an intracellular interaction of the 
receptor protein with its cognate heterotrimeric GDP-bound 
G protein. The agonist-promoted conformational change of 
the receptor protein followed by the cytoplasmic G protein- 
coupling initiates the activation of intracellular effector 
systems by the G protein cycle (Fig. (2)). The coupling 
event catalyzes the exchange of GDP against GTP and the 
dissociation of the GTP-bound a subunit from the Py 
heterodimer. Depending on the very nature of the G protein 
a subtype, different effector systems such as enzymes (e.g. 
adenylyl cyclase, phospholipase C) or ion channels are 
functionally modulated, which substantially amplifies the 
production of second messengers. The effector activation 
event is accompanied by a GTPase activity of the a subunit 
releasing inorganic phosphate. The GDP-bound form 
converts the a subunit to exhibit high affinity for the py 
heterodimer, finally forming the GDP-bound heterotrimeric 
G protein again. The modulated concentration of second 
messengers elicits phosphorylation cascades across the 
cytoplasm to the nucleus, eventually activating the final 
physiological response of a cell to the original extracellular 
stimulus. Even though this functional paradigm accounts for 
all known GPCRs, this obvious convergence after the ligand 
binding event is diversified by the selective activation of 
only distinct types of G proteins from which e.g. numerous 
different G a subunits are known (Fig. (2)) [32-35]. 




effector system] 
adenylyl cyclase 
Ca-channels 
Na-channels 
phospholipase C 
cGMP phosphodiesterase 



Fig. (2). Schematic representation of the ligand-GPCR interaction mediated G protein cycle. 
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In order to fully characterize the mechanism of action of 
GPCRs, a thermodynamic "eight-state-model" has been 
developed as a mechanistic hypothesis describing the 
macroscopic properties of transitions among distinct 
conformational states (Fig. (3)) [36]. The simplest way to 
describe the ligand-induced receptor activation event is a 
ternary complex model (A) that defines two distinct affinity 
states of the receptor for agonist binding, notably the tree 
receptor (Rec) and the G protein-bound form (G'Rec) (f ig. 
(3)A). Agonists would display high affinity to the G protein- 
associated receptor, while antagonists would exhibit only 
low-affinity towards the complex. With the discovery that 
GPCRs can activate G proteins in the absence of any 
agonist, the simple ternary complex model required an 
extension. To account for the existence of such conslitutively 
active GPCRs, a receptor activation step in the unliganded 
form was introduced (Fig. (3)B). This receptor isomenzation 
hypothesis resulted in a "six-state-model" in which the 
activated receptor (Rec*) is capable of signalling in both the 
G protein-associated form (G'Rec*), and in the ternary 
complex (G'Rec-Lig). The problem with thai t receptor 
activation-extended ternary complex model is that the o 
protein only binds to the receptor in its activated form *ec . 
Experimental evidence clearly suggests that G proteins do 
also bind to the resting state (Rec) without subsequent G 
protein activation. To account for these findings and to refer 
to the microscopic reversibility principle of thermodynamics, 
an "eight-state-model" was proposed in which the receptor 
protein can undergo three distinct processes, namely ■ (i) 
ligand binding, (ii) receptor isomenzation. and (111) 0 
protein binding (Fig. (3)C). Agonists can bind to four 
different receptor states clearly favounng the activated states 
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generated either by receptor isomenzation or by G protein 
association. Inverse agonists would prefer to bind the non- 
activated groundstate (Rec), while partial agonists show 
affinity to both receptor states but still cause receptor 
activation. In the thermodynamic "eight-state-model" an 
antagonist would just block the interconversion of different 
states rather than preferably bind to distinct states (Fig. (3)) 
136]. 

In order to address phenomena such as isosleric or 
allosteric antagonism, structural models with atomic 
resolution are mandatory that are actually frequently used for 
both rationalizing structure-activity relationships of low 
molecular weight agonists and antagonists, as well as 
understanding the results from site-directed mutagenesis 
experiments. A detailed discussion of the actual status of 
experimentally derived, and molecular modeling derived 
GPCR structures [37] is beyond the scope of this review, 
since this contribution is mainly aimed to introduce the 
currently applied technologies to identify compounds 
selectively modulating peptide-binding GPCRs. 



GPCR Classification 

Exhaustive sequence analysis revealed three major 
homology families for the mammalian GPCRs, notably the 
family 1 or rho-family (prototype: rhodopsin), the family 2 
or scr-family (prototype: secretin receptor), and the family 3 
or mGluR family (prototype: metabotropic glutamate 
receptors) receptors (Fig. (4)) [32-35]. Family 1 receptors are 
divided into further subfamilies according to the size and 



ligand binding 




Rec 



— RecLig 
A 



Rec ^ N RerLig Rec 

//:, #M 

Rec*^z^Rec*-Lig \\ Rec*^f= 

A 



V// 



^Rec*Lig 



^Rec*'bg 

A 



G'Rec ^ 



G*Rec*Lig 



e^^S«c4« M^8"V* G-Rec'^G-Rec-Ug 

. , , ... *r rprn activation* A: "four-state" model; B: "six-state" model; C: "eight- state" model, 
signalling (for details see text). 
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H^ff family 2 




C0 2 H 

calcitonin VIP 
a-latrotoxin PACAP 
secretine GnRH 
PTH CRF 



family 1 



H 3 N 




CO a H 




family 3 




mGluRs 
Ca 2 * 
GABA 
pheromones 



C0 2 H 



H 2 N 




H 2 N 
3»l 



retinal odorants C0 2 H 

biogenic amines adenosine 
opiates enkephalins 



H 2 N m»-f? 

b 





peptides cytokines C0 2 H 
IL-8 formyl peptides 

thrombin 



glycoproteins C0 2 H 

hormones 

(LH, TSH, FSH, ...) 



Fig. (4). Sequence homology-derived classification of GPCRs. Each GPCR family is characterized by a common ligand binding mode. 



chemical nature of their corresponding agonists, as well as 
the mode of ligand binding. Family la accommodates the 0- 
adrenoceptor-type receptors that are activated by small 
ligands such as biogenic monoamines, opiates, nucleotides, 
and small peptides, that comparably bind to a 
transmembrane cavity formed by helices 3, 4, 5, and 6. 
Family lb is composed of receptors stimulated by 
oligopeptides and proteins such as IL-8 (interleukin-8), 
cytokines, and thrombin. The ligand binding epitope is 
located in the extracellular loop region. Family lc receptors 
recognize glycoprotein hormones such as LH (luteinizing 
hormone), TSH (thyroid-stimulating hormone), and FSH 
(follicle-stimulating hormone) while their ligand binding site 
is centred in a large extracellular /^-terminal domain (Fig. 
(4)). 

Family 2 receptors are distinct from rho-family receptors 
in that they bind large peptides like glucagon, secretin, PTH 
(parathyroid hormone), VIP (vasointestinal peptide), or CRF 
(corticotropin-releasing factor). Comparable to family lc 
receptors, the secretin family utilizes a large //-terminal 
domain for ligand binding. Family 3 receptors are unique 
since they possess a large extracellular Af-terminal domain of 
several hundred residues that constitutes the binding site for 
smallish ligands such as a single divalent Ca 2 * cation, 
glutamate, GABA (y-amino butyric acid), and pheromones 
(Fig. (4)). 



On the light of this classification, peptide-binding 
receptors are not structurally homogenous since they belong 
to family 1 and 2. Consequently, correlation of sequence 
homology with ligand similarity remains questionable 
which is also reflected by the mutual different binding modes 
of peptidic and non-peptidic agonists and antagonists. 



Ligand Variety 

GPCRs are stimulated by an amazingly large number of 
agonists covering a broad range of chemical diversity. 
Ligands are as small as divalent cations, biogenic 
monoamines such as acetylcholine or serotonin, fragrances 
and taste molecules such as aspartam or limonen, single 
amino acids such as glutamate or GABA, or nucleotide 
analogues such as adenosine. Medium-sized ligands range 
from cannabinoids over prostaglandines to small 
oligopeptides such as enkephalins, angiotensin II, 
bradykinin, somatostatin, and tachykinins. Larger 
oligopeptides and globular proteins constitute the family of 
macromolecular ligands including e.g. neuropeptide Y, C5a 
anaphylatoxin, interleukin-8, or chemokines. Even 
proteolytic enzymes such as thrombin, which activates its 
receptor by cleaving off an //-terminal peptide, selectively 
bind to distinct members of the GPCR superfamily. Apart 
from their important role in sensory perception including 
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Enl^i classes, i.e. nucleosides , HpJJ mediators, 
neurotransmitter, peptides, and proteins [6,1 8.38J. 

In this context, it is interesting to note that the majority 
of GPCR-targeted therapeutic principles exploit only a single 
compound class, notably the neurotransmitters. When he 
number of currently identified neurotransmitter receptors is 
compared with the number of disease-relevant peptide- 
binding GPCRs, an obvious imbalance becomes apparent in 
that only a small number of peptide-binding GPCRs is 
targeted by established therapies. Agonism and antagonism 
of e a a and B adrenoceptors, dopamine, histamine, 
serotonin, or muscarinic acetylcholine receptors ,m : well 
established therapeutic principles for numerous °«^ng 
drugs covering virtually all therapeutic areas, including 
oasfrointestinal, cardiovascular, and CNS ind.ca ions. In 
contrast, only two peptide-binding GPCR families are 
addressed by marketed non-peptide drugs, namely the opioid 
receptors and the angiotensin II receptor H°w e ve^e 
importance of peptide- and protein-binding G PCRs for drug 
Every continues to be manifested by the fact that across 
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current pharmaceutical research, especially in industry, 
numerous projects are pursued to identify leads that, upon 
optimizations fulfil all pharmacodynamic and 
pharmacokinetic demands required for clinical applicability 
(Table 1). 

CLASSICAL LEAD FINDING AND DRUG 
DEVELOPMENT 

Currently applied drug design and discovery approaches 
are typically classified as rational or random, depending on 
whether or not structural rationales are employed. The area of 
GPCR agonists and antagonists research is mainly driven by 
screening approaches in which large numbers of randomly 
selected chemical entities are tested in high-throughput 
screens. These shotgun procedures provide a practical means 
for identifying new leads for a particular receptor. In the 
following, this classical approach for GPCR-targeted drug 
discovery will be exemplified with prototype studies 
conducted on the angiotensin II, endothelin, bradykinin, 
neurokinin, and NPY receptors, respectively. 



Table 1. 



Selection of endogenous Peptides that E«r, .heir Biologic.. Activity by Selective Activation of a GPCR 




Peptide-Bindwg C Protein-Coupled Receptors 
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CPCR | 


code 


native ligand (pep tide/protein) 1 


nature of the ligand 


vasopressin receptors 


V|A.VtB.V 2 


vasopressin 


Cys-Ty r- Pne-O 1 n- Asn-v-ys-rro-Arg- 
Gly-NH;! 


oxytocin receptor 


OT 


oxytocin 


Cys-Tyr-I le-Gln-Asn-Cys-Pro-Leu- 
Gly-NH 2 


vasotocin receptor 


VT 


vasotocin 


Cys-Tyr-Jle-Gln-Asn-Cys-Pro-Arg- 
Gly- NH 2 




OX|,OX 2 


orexin A/B 


33 aa/28 aa peptide amides 


FSH receptor 


FSH receptor 


follicle-stimulating hormone (FSH) 


protein 


LSH receptor 


LSH receptor 


tu tropin, choriogonadotropic 
hormone, lutcnizing hormone 


nrnttf in 


TSH receptor 


TSH receptor 


thyrotropin, thyroid-stimulating 
hormone 




LHRH receptor 


LHRH receptor 


gonadotropin-releasing hormone 
(GnRH), luteinizing hormonc- 
releastng hormone (LHRH) 


nniii.Hk-Trn-Ser-Tvr-Glv-Leu-Arfi- 
Pro-Gly- NH 2 


thyrotropin-relcasing hormone & 
secretagogue receptors 


TRHi.trh2 


thyrotropin-releasing 
hormone/factor (TRH/F) 


pGIu-His-Pro-NH 2 


OHS receptor 


GHSR| a , GHSRjb 


growth hormone secretagogues 
(GHS) 


oligopeptides 


calcitonin/calcitonin genc-rclaied 
peptide receptors 


CGRPR 


calcitonin, calcitonin gene-related 

nentide f CGRP) 


32 aa peptide amide 


amytin receptor 


amylin receptor 


amylin 


37 aa peptide amide 


adrenomedullin receptor 


adrenomedullin receptor 


adrenomedullin 


52 aa peptide amide 


corlicotropin-releasing factor 
receptor 


CRF|,CRF 2 


corticotropin-releasing factor 


4 1 aa peptide amide 


gastric inhibitory peptide receptor 


gip receptor 


gastric inhibitory peptide (GIP) 


42 aa peptide 


glucagon/glucagon-! ike peptide 
receptor 


GLPl 


glucagon 


29 aa peptide 


growth hormone-releasing hormone 
receptor 


GHRH receptor 


growth hormone-releasing 
hormone/factor (GHRH/GRF) 




parathyroid hormone receptor 


type 1, type 2 


parathyroid hormone (PTH) 


84 aa peptide 


secretin receptor 


secretin receptor 


secretin 


27 aa peptide amide 


vasoactive intestinal peptide & 
PACAP receptor 


VPAC| f VPAC2. 
PAC| 


vasoactive intestinal peptide (VIP) 
pituitary adenylate cyclase 
activating peptide (PACAP) 


28 aa peptide amide 
38 aa peptide 



Angiotensin-!! Antagonists 

Biomedical Significance 

The endogenous octapeptide hormone angiotensin-II (A- 
II) (Table I), Asp-Arg-Val-Tyr-IIe-His-Pro-Phe, is the key 
effector compound of the renin-angiotensin system (RAS) 
which is one of the main blood pressure and electrolyte/fluid 
homeostasis regulating system in mammals [39]. As a result 
of a proteolytic cascade starting with angiotensinogen, 
angiotensin-II is released from its precursor decapeptide 
angiotensin-! by the action of angiotensin-I converting 
enzyme (ACE) # the latter being a qualified target of 
antihypertensive drugs [40). The conversion from 
angiotensinogen to angiotesin I is catalyzed by the aspartic 
protease renin, peptide-type inhibitors of which have not yet 
reached an advanced state of clinical development [41]. A- 1 1 
interacts specifically with two different receptor subtypes of 



the GPCR superfamily, notably the AT\ and the AT 2 
receptor, respectively [21]. Interaction with the AJi receptor 
causes severe vasoconstriction, aldosterone release, 
vasopressin secretion, and renal sodium reabsorption. These 
effects convergently result in a dramatic increase of 
extracellular fluid volume, thus giving rise for a significant 
hypertensive effect. Therapeutic intervention into the RAS 
clearly offers major clinical and commercial success as shown 
with the ACE inhibitors for the treatment of hypertension 
and congestive heart failure [40]. Due to the fact that ACE 
inhibitors cause dry cough and angioedema [42], new 
strategies have been sought to block the vasocontrictory 
activities of the biologically active player, A-II [43]. Specific 
inhibition of the A-II target receptor interaction, the final step 
of the RAS, offers an entirely new and selective approach to 
blocking this regulatory system regardless of the source of 
the biological active peptide. And indeed, selective 
nonpeptide A-II antagonists emerged as a new class of 






Fig. (5). Structures of marketed All antagonists. 

antihypertensives on the cardiovascular drug market 
exemplified by the released drugs Losartan 1 [44,45], 
Valsartan 2 [46], Eprosartan 3 [47], lrbesartan 4 [48], 
Candesartan 5 [49], and Telmisartan 6 [50], respectively 
(Fig. (5)). 

Consequently, the angiotensin receptor represents one of 
the most advanced drug targets from the family of peptide- 
binding (non-opioid) GPCRs in the sense that screening hits 
have successfully been transferred to leads, further to 
development candidates that finally reached the drug market 
as save and innovative drugs introducing a new therapeutic 
principle. 

Lead Finding 

In the search for A-II antagonists potent peptides have 
been synthesized in a classical ligand-based design concept, 
yielding e.g. [Sar , ,Ala 8 ]-Angiotensin-II, commonly termed 
Saralasin [51]. However, all these peptides display limited 
therapeutic value as potential antihypertensives due to their 
poor oral bioavailability, rapid excretion, structural 
complexity, and significant agonistic profiles [5 1,52). 




The feasibility of identifying nonpeptide AT receptor 
binding compounds with purely antagonistic profile was 
demonstrated by a research group at Takeda Chemical 
Industries in 1982. In a series of two patents, Furukawa and 
co-workers reported on the inhibition of angiotensin-II- 
induced contractile response in rabbit aorta by numerous 
different l-benzylimidazole-5-acelic acid derivatives (Fig. (6)) 
[53]. The two compounds S-8307 7 and S-8308 8 mark the 
beginning of a new era of antihypertensive drug research in 
which almost any pharmaceutical company attempted to 
derive new compounds from that initial findings. 

Drug Development 

The Takeda compounds served as lead structures for the 
development of highly potent and selective analogues at 
DuPont that culminated in Losartan 1 (DuP-753, EXP- 
771 1), the first nonpeptide A-Il antagonist that got approval 
by the FDA and reached the market (Fig. (7)). Guided my 
molecular modeling studies, the substitution pattern of the 
benzylic phenyl-ring was changed yielding EXP-6155, 9 
which displayed a ten-fold increased binding affinity over 
e.g. S-8307 7 [54]. Further extension in paro-position of the 
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Fig. (6). Initial lead structures disclosed by Takeda Chemical Industries. 
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Fig. (7). Development of Losartan t. 

aromatic ring resulted in more potent analogues as shown 
with EXP-6803 10 [55). 

The deletion of the interaromatic carboxamide linkage 
yielding biphenylmethyl-substituted imidazole-5-acetic acid 
derivatives produced orally active compounds and 
subsequent exchange of the or/Ao-carboxylic acid on the 
terminal aromatic ring against the tetrazole moiety further 
improved the oral activity [56,57]. The imidazole-5-acetic 
acid substituent was modified to the corresponding alcohol 



in the analogue chosen as clinical candidate. However, later 
it could be shown that the parent acetic acid sidechain of the 
imidazole core is the active metabolite of Losartan 1 [58]. 

Instead of modifying the N-l substituent of the Takeda 
imidazole derivatives, 7 and 8, SmithKline Beecham 
decided to explore the 5 position in more detail (Fig. (8)). 
Introduction of an acrylic acid in that position (II) resulted 
in a 15-fold enhancement in binding affinity. Further 
introduction of a 2-thienytmethyl group in a-position of the 




Fig. (8). Development of Eprosartan 3. 
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Fig. (9). Next-generation "sartans" in advanced states of clinical development 



acrylic acid substituent (12) together with a modification ,,„ 
the N-l benzylic substituent finally yielded SK&F-108566, 
3 [59.60] which inhibits A-II binding to its receptor in the 
single digit nanomolar range (61]. 

The Ciba compound CGP-48933. 2 (Fig. (5)) is the 
result of an optimization process attempting »° re P ,ac % thc 
imidazole ring structure originally described by Takeda [53]. 
The l-benzyl-2-butyl-4-chloro-imidazole-5-acetic acid I is 
replaced with an ^-terminally acylated amino acid, notably 
valine. CGP-48933, 2 has passed the clinical development 
and reached the market as Valsartan [62]. It ,s clearly beyond 
the scope of this review to systematically -summarize the lead 
optimization programs pursued by the ^J""* 
pharmaceutical companies, however, it should be 
emphasized that, apart from the currently marketed drugs, 
numerous next-generation compounds and follow-ups in late 
clinical development are expected to get approved m the near 
future (Fig. (9)) [63.64]. These new "sartans ( .13 I - : 20) 
together with the first generation drugs (1 - 6) will further 
change the landscape of antihypertensive prescription drugs 
since they clearly introduced a new quality ot 



antihypertensive principles into therapy of cardiovascular 
diseases. 

Apart from these biomedical aspects, the development of 
the "sartans" acting specifically on a member of the GPCR 
superfamily evolved to a textbook example of protein- 
targeted drug design within modern medicinal chemistry 
[65]. 



Endothelin 

Biomedical Significance 

Endothelin 1 (ET-1) is a 21 amino acid bicyclic peptide 
(Table 1) that was initially isolated from porcine aortic 
endothelial cells [66]. The endothelins constitute a class of 
three related isopeptides (ET-1, ET-2, ET-3) [67], 
exhibiting vasoconstrictive and mitogenic potential [68] 
upon binding to two receptor subtypes, notably the ET A and 
ET B receptor [69.70]. ET-1 selectively binds to the ET A 
receptor which is expressed on vascular smooth muscle cells 
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Fig. (10). Ary I -sulfonamide- type ET antagonists. 

(lung, aortic, heart) and mediates vasoconstriction and 
proliferation through activation of a complex intracellular 
signalling cascade [71,72]. The ET B receptor, localized in 
the brain, on vascular endothelial cells, and smooth muscle 
cells, is responsible for vasodilation via the release of nitric 
oxide, prostacyclin, and adrenomedullin [73,74]. In 
addition, ETq functions as a clearance receptor for 
endogenous ET by the internalization of the receptor-ligand 
complex. On the other hand, ETb may also cause 
vasoconstriction in some tissues (75). ET A and ETq 
receptors share high sequence similarity (app. 68%). ET-l is 
predominantly produced by endothelial cells acting in an 
autocrine and paracrine fashion as a mediator of vascular 
function. Elevated ET levels has been observed in tissue and 
plasma in a number of cardiovascular disorders, thereby 
contributing to disease states including hypertension [76], 
vasospasm, atheriosclerosis [77], acute myocardial infarction 
[78], congestive heart failure [79,80], restenosis [81], 
subarachnoid hemorrhage, ischemia, pulmonary hypertension 
[82], and renal failure [83]. Due to the pivotal 
pathophysiological role of the endothelin receptor-ligand 
interaction, this receptor system emerged as a promising 
target for therapeutic intervention in the disease states 
mentioned above [84]. 



Lead Finding 

Since the discovery of ET-l in 1988, a large number of 
potent antagonists have been described [84]. The first 
antagonists emerging from random screening efforts have 
been reported in 1992. These first generation compounds 
■omprise anthraqui nones from Streptomyces misakiensis % 
steroids isolated from bayberry, My r icq cerifera t and 
diphenyl ethers discovered in fungal broths [85). Lead 
finding in this field is mainly based on compound library 



screening followed by classical lead optimization within 
medicinal chemistry programs. A number of peptide-based 
antagonists have been reported including the prominent 
cyclic pentapeptide BQ-123, and other peptide antagonists, 
e.g. BQ-788, FR-1393I7, PDI45065, PDI56252, RES- 
701-1, TAK-044, and IRL2500 [84-88]. 

As mentioned above, this review, will focus on the 
development of nonpeptide antagonists emerging from those 
programs directed towards the discovery of active low 
molecular weight compounds. Primarily, the ET A -selective 
antagonists as well as antagonists exposing mixed ET a /ETb 
affinity play a major role for therapeutic intervention, even 
though some ETa-selective antagonists have been reported 
only recently. 

Arvl Sulfonamides 

Bristol Myers Squibb designed BMS 182874, 21, a 
nonpeptide ET A -selective antagonist from an initial hit 
which was discovered by screening of a sulfathiazole library 
[89]. The sulfonamide BMS182874, 21, exhibits an IC 50 
value of 150 nM at the ET A receptor (A10 cells) and shows 
no binding affinity to the ETq receptor (Fig. (10)). 

From a similar series of compounds, 
Immunopharmaceuticals (Texas Biotech.) developed an 
isoxaolyl-thiophene sulfonamide, TBC-11251, 22 
(Sitaxsentan) [90]. This orally active compound has shown 
efficacy in phase II clinical trial of congestive heart failure 
(CHF) and demonstrated activity in a rat model of 
myocardial infarction and acute hypoxia-induced pulmonary 
hypertension (PH) [91]. Further investigations established a 
unique pharmacophore framework, characterized by a central 
thiophene subunit for selective ET A antagonism [92]. 
Maintaining the sulfonamide substituent in position 3 and 
altering the substituent in position 2 in the thiophene ring 
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Fig. (11). Butenolide-type ET antagonists. 

led to a series of compounds with enhanced !*"™^J«J 
properties. TBC-2576, 23, the optimal analogue in ihw 
series showed about 10-fold higher ET a b.nd.ng a fimly 
compared to Sitaxsentan, 22. and high ET A -selectivily, as 
3 as a serum half-life of 7.3 h in rats, pa.red w.th w v,vo 
activity (Fig. (10)) [92]. 

A number of nonpeptide ET A /ET B antagonists based on 
a pyrimidyl-benzene sulfonamide scaffold have bear .reported. 
The first example for an orally active represen at ve is Ro46- 
2005 24 [931 which was obtained after optimization of a 
Tead compounds identified by random screening an 
antidiabetic project. The binding pities of Ro46-2005, 24 
=220 nM (ET A ) KrluOO nM (ET B )) could further be 
Simfzed yle din^ike bipyrimidyl-benzene analogue Ro47- 
SS 25 (Bosentan) which represents an improvement .n 
both receptor binding affinities (*s=4.7 nM (ET A ). K\-95 
nM (ET B )) and oral activity (Fig. (10)) (94),Bosemar , 25is 
a compe tive mixed ET A /ET B antagonist and shows 
promising results in clinical trials [881 jj™^ 
vasodilation. Further, it improv es ventr cula 
performance and reduces renal dysfunction ^ e b ^ ne » c ' a ' 
Effects of Bosentan 25 have been charactenzed in CHF 
models in hypertension related experiments and in 
Tbarachnoid hemorrhage (SAH) trials. These a^o£er 
potential applications have been described in a recent review 
by Roux el al [88]. 

Rntenolides 

Cl-1020, also known as PD156707, 26. 27 [95] emerged 
from the optimization of an initial lead structure which was 
identified from library screening (Fig. (11)). *nc 
l X^:^L wL 8-ded by following the r^to 
"decision tree" approach based on QSAR principles [96]. Cl- 
1020 26 27 repents the first clinical candidate ^merging 
from .he'Parke'Davis series 

value of 0.30 nM on recombinant ^-f^S 
MC<n=780 nM (ET B )) it demonstrates high ET A -select yity 
f&lSi CM020. 26. 27 undergoes fjon^j. 
hereby establishing the Y -hydroxy butenohd ^^iumVs 
under acidic conditions, while at basic P H the J»^num» 
shifted in favour of the ring-opened y-keto acid salt structure 
26 [95] The poor water-solubility of this compound, caused 
oycyciizTtion. has driven the drug deve opmen, 
towards a series of water-soluble rmg-closed y-hydroxy 
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butenolides applicable for parenteral use [97]. One of the 
follow-up compounds exhibits promising pharmacological 
profiles by displaying improved activity compared to C. I- 
1020, 26, 27 e.g. in preventing acute hypoxia-inducea 
pulmonary hypertension (PH) in rats. 

Most promising characteristics were found for an 
analogue containing the sodium salt of a sulfonic acid in 
compound 28 (Fig. (11)) [97]. It shows high ET A - 
selectivity (4200-fold) with an 1C 50 value of 0.38 nM (ET A ) 
and ET A functional activity of K B =7.8. which is similar or 
even superior to the progenitor Cl-1020 26. 27. Moreover it 
displays improved water-solubility and- shows higher 
activity after /.v. infusion in preventing acute hypoxia- 
induced PH in rats (ED 50 =0.3 ug/kg/h) when compared to 
Cl-1020 2 6 27 [97]. The new compounds are currently 
evaluated in preclinical trials, while Cl-1020 26,27 has 
already been tested in a model of acute stroke and has entered 
clinical development for cerebral ischemia. 

Jnriane Ca rhoyvlic Acids 

SB209670 29 emerged from the SmithKline Beecham 
laboratories after optimization of an initial hit discovered 
from compound library screening (Fig. (12)) 198]. Within a 
molecular modeling-driven approach based on a comparison 
of theNMR-derived conformation of ET-1 with the primary 
hit, an indene carboxylic acid derivative, the mixed 
ETa/ETh receptor antagonist SB209670 29 was designed 
f^0 43 nM%ET A ).X,= 14.7 nM (*ET B )) When 
administered i.v. SB209670 29 shows efficacy in differen 
animal models of ET-mediated disease states, e.g. renal 
failure, hypertension [84], and ischemia-induced stroke. Due 
to the low oral bioavailability (4%) a structurally related 
analogue, SB2 17242 30 [99] was investigated that displays 
improved pharmacokinetics and bioavailability [86J. 
SB209670 29 is under development (phase 1) for acute i.v. 
indications with efficacy in pulmonary hypertension ^ (PH) 
chronic renal failure (CRF) and stroke [87], while SB217242 
30 (phase 1) is in development for chronic PH and chronic 
obstructive pulmonary disease (COPD) [87,100]. 

PyTPlirtn 1 ? rarh n * vlic Ac 'ds 

The SmithKline Beecham compound SB209670 29 
(Fig. (12)) served as template for the design of the 
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Fig. (12). Indane carboxylic acid-type ET antagonists. 

pyrrolidine carboxylic acid A- 127722 rac-31 (Fig. (13)) 
[101], that has been disclosed as a potent, ET A -se!ective 
antagonist, currently tested in clinical trials (PH, CHF) [87]. 
A- 127722 rac-31 was reported to prevent dose-dependently 
cerebral oedema in stroke-prone spontaneously hypertensive 
rats [100]. ABT-627 31. the active enantiomer (2rt,3K,4S) of 
the trans-trans configurated 2,3,4-trisubstituted pyrrolidine 
ring, shows an IC 50 value of 0.08 nM on ET A and 8.1 nM 
on ET B [102]. The 1800-fold selectivity was dramatically 
altered by subtle structural modifications of A- 127722 rac- 
31, which led to A-182026 32 with an ET A /ET B selectivity 
ratio of 3, thus being the most potent balanced dual 




ET A /ET B antagonist known today. Replacement of the 
dialkyl-acetamide (rac-31) against a 2,6-dialkyi-acetanilide 
resulted in an ETb -selective antagonist, A- 19262 1 33 
exhibiting promising pharamcological properties [103]. 
Combination of the structure-activity relationships (SAR) 
derived from the first series of ET A -selective compounds 
(e.g. ABT-627 31) and the second series of ET B -selective 
antagonists (e.g. A-192621 33) led to a further optimized 
series of compounds. Therein A-308165 34 has been 
identified as highly selective (27000-fold), orally acitve ET B 
antagonist [104]. 




Fig. (13). Pyrrolidine carboxylic acid-type ET antagonists. 
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Administration of ET B -selective antagonist led to 
hypertensive responses indicating that they are not suitable 
as agents for a long-term systemic single ET u -directea 
therapy [103]. Nevertheless, ET B -sclective antagonists are 
expected to be a valuable tool for the elucidation of the role 
of the ET B receptor action under norma and 
pathophysiological conditions (104]. Most recently an 
ET A -selective antagonist, derived by optimization of A- 
127722 rac-31, emerged from the series of pyrro idine-based 
compounds 1 105]. A-216546 35 is a further orally active ET 
receptor antagonist showing >25000-fold selectivity for the 
ET A receptor (Kj=0.46 nM), and is considered for clinical 
development as a therapeutic agent for chronic treatment of 
ET-l-mediated diseases [106]. Compound 36 <IC 50 =5.6 nM 
(ET A )- > 10000-fold selectivity) is currently under 
investigation at Abbott's Laboratories as ET A antagonist. 
Apart from the ET receptor affinity, A-216546 35 showed 
remarkable inhibition potential for numerous members of the 
GPCR superfamily such as adenosine receptors, 5-opioid 
receptor, purinergic receptor, etc. [106], thus indicating a 
kind of "ligand crosstalk" which turns out to be a common 
phenomenon of GPCR-targeted compounds. 

phenylaceiamides 

L-749,329 37 (Fig. (14)) is an orally active, competitive 
and nonselective ET A /ET B antagonist developed by Merck 
inhibiting the binding of [' 25 I]ET-1 in Chinese Hamster 
Ovary (CHO) cells expressing human ET receptors with 
1C 50 values of 0.8 nM (ET A ) and 16 nM (ET B ), respectively 
[107]. The active enantiomer, L-754,142 37, is a potent 
orally active ET antagonist with a long duration of action in 
several in vivo models. L-754,142 37 shows binding affinity 
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towards ET A (0.062 nM) and ET» (2.25 nM) and 
antagonizes ET-1 -induced phosphatidyl inositol hydrolysis 
in CHO cells expressing cloned human ET receptors with 
!C 50 values of 0.35 nM (ET A ) and 26 nM (ET B ) [108). 
Substitution of the ether oxygen against a methylene group 
resulted in L-75 1,281 38, an analogue with similar activities 
on both ET receptor subtypes [ 1 07] . 

(Y-Phenoxvn henvlacetic Acids 

At the Merck laboratories, structural modifications of an 
initial lead discovered by screening for angiotensin II (All) 
antagonists, led to a dual ATj/ET antagonist. Further 
optimization towards ET A -selectivity resulted in L-744,453 
39 (Fig* an a-phenoxyphenylacelic acid derivative 

lacking the sulfonamide present in the ary lacy I sulfonamides 
L-749,329 37, and L-751-281 38 [107]. L-744,453 39 
competitively and reversibly inhibits [ 125 I]ET-1 binding to 
CHO cells expressing cloned human ET receptors with K\ 
values of 4.3 nM (ET A ), and 232 nM (ET B ). Thus, within 
L-744,453 39 the shift from an originally angiotensin II 
antagonist to an ET-selective antagonist could be 
demonstrated, thus highlighting the potential of "cross- 
fertilization" of projects devoted to representatives of a 
common receptor superfamily. 
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Fig. (14). Phenylacetamide-type ET antagonists. 
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Fig. (15). Aryloxyacetic acid-type ET antagonists. 

q-Arvloxvacetic Acids 

Also at the BASF laboratories, the endothelin project 
started with screening of the in-house chemical substance 
stock. The initial lead, which was originally intended as a 
herbizide, was optimized by systematic structural 
modifications resulting in an ET A -seleclive antagonist, 
LU 135252 40 (Fig. (15)), the active (S)-configurated 
enantiomer of LU 127043 [109,1 10]. It selectively binds to 
the ET A receptor with high affinity (K\-2 nM (ET A ), 
Ki=184 nM (ET B )) [111]. LU1 35252 40 has been evaluated 
in clinical trials for preventing restenosis [87] and entered 
phase II for CHF [112]. Furthermore, it was demonstrated 
that selective ET A receptor inhibition with LU 135252 40 
could reduce ischemia-induced ventricular arrhythmias in 
pigs. Thus ET antagonism might reduce mortality by 
preventing arrhythmias, a major cause of death in CHF, 
obviously induced by the pro-arrhythmogenic effects of ET-1 
1100]. 



Peptide-Binding G Protein-Coupled Receptors 
p^noxvbutano r Ar ifk and Stilbene acids 

According to a previously elaborated SAR study, A sties 
et al at Rhone-Poulenc Rorer presented the optimized 
analogue RPR-1 11844 41 (Fig. (16)), which exhibits an 
IC™ of 5.0 nM at the rat ET A receptor and 1000-fold 
selectivity over the ET B receptor, The promising 
pharmacokinetics in a rat model of -ET- -induced 
vasoconstriction rendered this RPR-1 1 1844 41 an ideal 
candidate to examine these effects in preclinical models of 
cardiovascular disease [113]. 

In order to shed light on the characteristics of the 
bioactive conformation, a new series of rigidified analogues 
of stilbene acids were designed based on the SAR derived 
from a series of the phenoxybutanoic acids. Thus, compound 
RPR-1 1 1723 42 was identified as the most potent analogue 
with an IC 50 of 80 nM. Although the stilbene series was not 
further developed, results from SAR will be back-transferred 
into the more interesting series of phenoxy butanoic acids 
[114]. 
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Lead Finding 

'S econd-Generation' B 2 Antagonists 

Initiated by the discovery of NPC-567 by Vavrek and 
Stewart [123] in the 90's, a number of selective peptidic B 2 
receptor anatgonists including Icatibant (Hoe-140) [124,125] 
and Bradycor (Deltibant, CP-0127) [126], so-called 'second- 
generation' antagonists, have been clinically evaluated. In 
the following years, research programs were directed towards 
the discovery of B 2 -selective nonpeptide antagonists. 
Detailed overviews on this subject were provided only 
recently by Altamura et al [127] and Heitsch [128] 
addressing projects of diverse research group, and reviewing 
the current patent situation. 

In 1993, the naphthylalanine derivative WIN-64338 43 
(Fig (17)) was disclosed as the first nonpeptide B 2 
antagonist [129,130]. A random screening approach at 
Sterling Winthrop led after optimization to compound WIN- 
64338 43, displaying a K\ value of 64 nM for the inhibition 
of [ 3 H]BK binding to the B 2 receptor (1MR-90 cells, fetal 
lung fibroblast cell line expressing the kinin B 2 receptor). 
However, this compound is problematic in terms of potency, 
oral bioavailability, and selectivity [130], since significant 
affinity for e.g. the muscarinic receptors was detected [131]. 



Fig. (16). Phenoxybutanoic acid- and stilbene acid-type ET 
antagonists. 

Bradykinin 

Biomedical Significance 

The nonapeptide bradykinin (Bit, Table I), Arg-Pro-Pro- 
Gly-Phe-Ser-Pro-Phe-Arg, belongs to the family of kintns. 
Kinins are small peptides which are released from 
kinninogens by several enzymes, the kallikreins [1 15-120]. 
Interaction of BK with two designated receptor subtypes U\ 
and B 2 , results in a variety of biological effects including 
vasodilation, modulation of vascular permeability, smooth 
muscle contraction, recruitment and priming of inflammatory 
cells induction of pain, modulation of transmitter release, 
stimulation of cell division, etc. [121]. Based on these 
diverse biological activities, BK. is involved in inflammatory 
diseases, such as asthma, rhinitis, pancreatitis, sepsis, 
rheumatoid arthritis, brain oedema, and angioneurotic 
oedema [122]. Due to these pathophysiological actions ot 
BK, mainly induced by the interaction with the B 2 receptor, 
this system emerged as an interesting target in 
pharmaceutical research. Hence, in a number of efforts BK. 
antagonists were presented tempted to be a valuable tool in 
the treatment of above mentioned chronic diseases. 




Fig. (17). 'Second-generation* B 2 antagonists WIN64338. 

Third-Generation' B 2 A ntagonists 

From 1994 on Fujisawa published a series of patent 
applications on new classes of potent, selective and orally 
active nonpeptide B 2 receptor antagonists [132-135], thereby 
establishing the so-called Uhird-generatiori compounds. 
Several derivatives showed nanomolar affinity in receptor 
binding assays and high efficacy in various species induing 
humans. They also exhibited in vivo functional antagonistic 
activity against BK-induced bronchoconstriction in guinea 
pigs and potency in diverse animal models of infiammation 
[132-135] [136,137]. Again, these compounds originally 
emerged from a random screening directed towards the 
angiotensin 11 (All) AT, receptor and belong to a class ol 
imidazo[l,2-a]pyridines. A detailed description of the 
design, synthesis and biological evaluation was given by 
Kayakiri et a/., only recently [138]. The first lead compound 
44 (Fig. (18)) of this series of /V-containing heteroaromaiic 
benzyl ethers showed an IC50 value of 7.6 uM. 

Within a classical medicinal chemistry approach based 
on SAR considerations the first lead compound 44 was 
exposed to extensive modifications leading to 45 (big- 
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This analogue displays an IC 50 value of 2.4 nM for the 
inhibition of the specific binding of [ 3 H]BK to B 2 receptors 
in guinea pig ileum (GPI) membrane preparations. Thus, the 
8-[3-(A'-acylglycyl-A'-methylamino>2,6-dichloroben2yloxy- 
3-halo-2-methylimidazo[l,2-fl)pyridine skeleton was 
identified as the basic framework of the first orally active 
nonpeptide B 2 antagonist. In order to overcome species 
difference, further modifications within the 3-position of the 
benzyl moiety revealed an analogue (FR167344 46) 
exhibiting subnanomolar (IC 50 =0.66 nM) and low 
nanomolar binding affinities (IC S0 =I.4 nM) for GPI 
membrane and human A431 cells (epidermoid carcinoma 
cells) [136,139], respectively. 

Recent results indicate that FR1 67344 46 has specific 
antagonistic activity against guinea pig tracheal smooth 
muscle BK receptors, thus rendering it a potential 
therapeutic tool for the treatment of asthma [14UJ. 
Derivatives containing the //.tf-dimethylcarbamoyl- 
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substituted cinnamide group were capable of overcoming 
species differences, and therefore defined the required 
pharmacophore for further investigations. FR 167344 46 was 
assigned as new lead compound for three independent 
optimization approaches implying substitutions within the 
imidazo[ 1 ,2-o]pyridine moiety (benzimidazoles, 
quinoxalines, and quinolines). While further optimization of 
the quinoxaline series failed, optiization within the 
benzimidazole and quinoline series resulted in several potent 
congeners. Thus, consequent SAR studies of the 
benzimidazoles afforded improvements of in vivo oral 
activities, resulting in FR 185627 47 which exhibits 75.2 % 
inhibition against BK-induced bronchoconstriction at 0.32 
mg/kg, i.p. [138]. Optimization of the quinoline series 
afforded compound FR 173657 48 with high potency in B 2 
binding affinities for both GPI (IC 5 o=0.46 nM) and human 
recombinant B 2 receptors (IC 50 = 1.4 nM) [136,141]: 
FR1 73657 48 displays the best in vivo B 2 antagonistic oral 
activity among nonpeptide antagonists investigated so far 
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Fig. (18). Third-generation' B2 antagonists developed by Fujisawa. 
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and was chosen as a clinical candidate for the treatment of 
various inflammatory diseases. Recent investigations on 
plasma extravasation mediated by activation of sensory 
nerves in guinea pig airways suggest FR173657 48 to be an 
orally active, promising anti-inflammatory agent for kinin- 
dependent inflammation following antigen challenge [142]. 
Fujisawa researchers further report on the postulation of the 
active conformation of their compounds by. synthezising 
conformationally restrained analogues. Molecular modelling 
studies and subsequent chemical synthesis of a novel pyrrole 
series afforded FRI93144 49, an analogue which mimics the 
previously postulated cis-conformation of the Af-methylamide 
by the pyrrole moiety. FR193144 49 exhibits excellent 
binding affinity for human recombinant B 2 receptors 
(IC 5O =0.26 nM), thereby proving the cis-con formation as uie 
oioactive conformation of the AT-methylamide bearing 
antagonists (Fig. (18)) [138]. 

Interestingly, only minor variations within the core 
structure of the B 2 antagonists resulted in an analogue, 
FR 1 90997 50 (Fig. (18)), exhibiting an agonistic protile 
[1431 The agonistic behaviour is hypothesized to be 
encoded in the difference concerning the 4-substituent of the 
quinoline moiety within the agonist compared to the 
antagonists (H * 2-pyridylmethoxy). FR190997 50 induces 
hypotensive response in anaesthetized rats and thus, is 
claimed for the treatment of hypertension, renal failure, heart 
failure, circulatory disorders, angina, restenosis, hepatitis etc 
[143}. 

p- fn^npUt f Structurally Pf'?!^ FR1736S7 

Compounds evaluated at Fournier are structurally related 
to Fujisawa's quinoline series differing mainly in the 
substituent in 3-position of the benzene-linkage which is 
replaced by a sulfonamide. LF16-0335 51 (Fig. (19)) is a 
potent selective and competitive antagonist of the human B 2 
receptor, displacing f/HJBK binding to membrane 
preparations of CHO cells expressing cloned human B 2 
receptors with a K\ value of 0.84 nM. 

LF16-0335 51 shows neither affinity for the B| receptor, 
nor binds significantly to any other ^f! 0 ! 
except the muscarinic M2 (IC 50 =0.9 uM) and Ml (ICjo-1.0 
uM) receptors [144]. The hydrochloride of this derivative. 
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LF16-0335C, inhibits competitively BK-induced 
contractions of isolated rat uterus and GPI in functional 
assays [145]. Given />., LF16-0335C inhibits BK-induced 
hypotension in both animal species in a dose-dependent 
manner [145]. Substitution of the piperaz.ne ring , in LF16- 
0335 51 against a diaminopropane unit led to LF 16-068/ ^ 
(Fie (19)) which was shown in competition binding studies 
with [ 3 H]BK to bind to the human recombinant B 2 receptor 
expressed on CHO cells with an K; value of 0.67 nM (LF16- 
0335 51 K-,=0.84 nM). It functions as a competitive 
antagonist of BK-mediated contractions in isolated organs, 
i e rat uterus and GPI. Contrary to LF16-0335 51, , LFlo- 
0687 52 showed selectivity for the B 2 receptor m binding 
and functional studies performed on more than 40 different 
receptors. 

In a new series of patent applications, Hoechst claimed a 
number of derivatives based on the lead structures delineated 
by Fujisawa as potent B 2 receptor antagonists. These 
heteroarylbenzyl ethers belong to a series of 0-substituted 8- 
quinolines or 4-benzothiazoles [146]. Heitsch el at. report 
that the potency of the quinoline series was found to be 
higher compared to the corresponding benzothiazoles. The 
most potent antagonist 53 (Fig. (20)) shows an IC» value of 
0 7 nM for the inhibition of specific binding of ['HJBK. to 
GPI membrane preparations and an EC 50 value of 4.1 nM for 
the inhibition of BK.-induced contraction of isolated GPI. 

The most potent corresponding antagonist of the 
benzothiazole series 54 (Fig. (20)) exhibits an IC 50 value of 
10.3 nM and an EC 50 value, of 54 nM Another 
representative example of the B 2 antagonist claimed by 
Hoechst is compound 55 (Fig. (20)) which incorporates a 2- 
aminoethanol unit instead of the N-methylamide as linker in 
the central part of the molecule. 55 inhibits ( 3 H]BK binding 
(GPI) with a K\ value of 20 nM [127,128]. 

Based on the template FR 173657 48. Kyowa Hakko filed 
a patent application claiming heteroarylbenzyl ethers as B 2 
antagonists [147]. Like in FR173657 48, the central ether 
entity is flanked by a terminal quinoline and a 
dichlorobenzene linker. Instead of the classical N- 
methylamide sidechain in 3 position, the dichlorobenzene 
linker bears a branched hydrocarbon chain (56. Fig. (20)). 
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Fig. (19). R> receptor antagonists disclosed by Fournier. 
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Fig. (20). Miscellaneous heteroary I benzyl ether-type B2 antagonists. 

Miscellaneo us Nonpeptide B 2 Antagonists 

From screening of a 4000 compound combinatorial 
library, GlaxoWellcome found a promising 
tetrahydroisoquinoline, GR213548X 57 (Fig. (21)), with 
affinity for the receptor in the micromolar range [127]. 

Further B2 antagonists are claimed in a series of patent 
applications by a number of companies. American Home 
Products (AHP, Wyeth Ayerst) presented compound 58 
which structurally resembles the Fujisawa derivatives only 
with respect to a quinoline entity. Pfizer described 1,4- 




Fig. (21). Miscellaneous B2 antagonists. 




dihydropyridines such as 59 to act as B2 antagonists, while 
Eli Lilly disclosed benzothiophenes 60 (Fig. (21)) [127,148- 
150). 

Neurokinin 

Biomedical Significance 

Neurokinins (NKs) t also termed tachykinins belong to a 
family of peptides sharing a common homologous C- 
terminal fragment composed of the pentapeptide amide Phe- 
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Xaa-Gly-Leu-Met-NH 2 (Table I) [151]. The interaction of 
Substance P (SP), neurokinin A (NKA) , and neurokinin B 
(NKB) with their corresponding receptors [152], notably 
NK|, NK 2 , and NK 3 plays a pivotal role in induction and 
progression of inflammatory diseases. Neurokinin interaction 
is involved in a variety of physiological and pathophysiolo- 
gical conditionssuch as pain, inflammation, smooth muscle 
contraction, vasodilation, and activation of the immune 
system. Thus, NK receptor antagonists emerged as 
interesting agents for the treatment of primarily pain, emesis 
and asthma but also to interfere in other disorders such as 
anxiety, arthritis, migraine, cancer and schizophrenia [153- 
156]. NK receptor antagonists have been reviewed e.g. by 
Elliot and Seward [157], von Sprecher et ai [158], and, 
only recently, in Current Medicinal Chemistry by Gao and 
Peet [159]. Therefore, this contribution will solely focus on 
nonpeptide NK antagonists. 



Lead Finding 
NK{ Antagonists 

The* quinuclidine-based analogue CP-96,345 61 (Fig. 
(22)) was developed from a lead structure found by random 
screening and is the first nonpeptide NK! -selective 
antagonist showing an lC 5 o value of 0.77 nM (lymph'oblast 
IM-9 cells) [160]. Over the last years, CP-96,345 61 evolved 
as the main pharmacological tool in the area of NK receptor 
research. 

A second series of piperidine-containing analogues 
developed at Pfizer includes CP-99,994 62 [161] and CP- 
122,721 63 (Fig. (22)) [162]. CP-99,994 62 exhibits 
analgesic efficacy [163] and shows less in vivo inhibition of 
NKj receptor-mediated responses compared to the 5- 
trifluoromethoxy analogue, CP- 1 22,721 63 [164]. The latter 




Fig. (22). Quinuclidine-. piperidine-, and morpholinc-dcrivcd NKj antagonists. 
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Fig. (23). Spiro-aryl piperidine-type NK] antagonists. 



congener shows improved antiemetic properties in acute 
cisplatin-induced vomiting in tumor patients when 
administered in combination with a 5-HT 3 antagonist [157]. 

Based on the piperidine core structure of CP-99,994 62, 
Merck synthesized L-733,060 64 (IC 50 =0.87 nM in CHO 
cells) [165] which, after modifications, led to the 
metabolically more stable L-754,030 65 (IC 50 =0.1 nM in 
CHO cells) (Fig. (22)) [166]. Recent results indicate that L- 
754,030 65 prevents cisplatin-induced emesis in patients 
receiving an anticancer chemotherapy [167,168]. 

Glaxo disclosed the 5-tetrazolyl-substituted analogue 
GR-203,040 66 (Fig. (22)) retaining the pipendine core 
structure of CP-99,994 62 asNKj antagonist (GR-203,040 
66: p*i=10.3 nM in CHO cells) which was selected for 
clinical evaluation in emesis and migraine [169,170]. 
Further modification revealed GR-205,171 67 (Fig. (22)) 
(p*i=10.6 nM in CHO cells) which, apart from oral 
bioavailability, exhibits also reduced L-type calcium channel 
activity, a side effect associated with e.g. CP-122,721 63. 
GR-203,040 66 ameliorates tissue damage induced by x- 
irradiation or cisplatin [171,172]. 

Novartis developed CPG-49,823 68 (Fig. (22)). based I on 
the piperidine scaffold for anxiety-related indications [173]. 
CPG-49,823 68 (IC 50 =12 nM, bovine retina) has been tested 
for its antagonistic potential against the depolarization of 
spinal motoneurons by bath application of the selective 
tachykinin receptor against septide(6-l 1) exhibiting an 1C 5 0 
value of 0.3 ^M (gerbil preparations) and 7.8 uM (rat 
preparations) [174]. 

The central piperidine unit is also found in the Sanofi 
compound SR-140,333 69 (Fig. (22)) (IC 50 =0.01 nM in 
1M-9 cells), also termed Nolpitantium, which emerged from 
a random screening approach followed by a lead optimization 
program [175]. 
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Investigations on the effects of SR-140,333 69 on 
nociceptive pathways in rats revealed this agent to be a 
potent drug for pain relief [176]. Kubota et al reported on 
the synthesis of spiro-piperidines as NKj receptor 
antagonists [177]. SAR studies starting from the primary 
lead YM-35375 70 (dual NK|/NK 2 antagonist) (Fig. (23)) 
yielded analogue YM-35384 71 as a selective NK] 
antagonist which was 12-fold more potent compared to the 
original spiro-isobenzofuran-l(3//)-4'-piperidine YM-35375 
70. YM-35384 71 already showed an 1C 50 value of 58 nM 
which could be improved by further modification resulting in 
compound YM-49244 72 (Fig. (23)), a spiro-substituted 
piperidinium salt with an IC 50 value of 1.9 nM against SP- 
induced contraction in guinea pig ileum and inhibitory 
activity against selective NKi receptor agonist-induced 
bronchoconstriction in guinea pigs (ID 50 =24 Hg/kg, /.v.) 
[177]. 

A further class of spiro-aryl piperidines is represented by 
Merck Sharp and Dohme's spirocyclic aryl sulfonamides, 
serine-derived NK| antagonists [178]. Compound 73 (Fig. 
(23)) exhibits an IC50 value of 1.0 nM for the displacement 
of [ i25 l]SP from NKj receptors in CHO cells and served for 
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Fig. (24). Lancpitant disclosed by Eli Lilly. 
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the development of a pharmacophore model for the receptor 
binding requirements [179]. 

Eli Lilly has identified the tryptophane-deriyed L-Y- 
303 870 74 (Fig. (24)) as a selective antagonist binding to 
NK, with high affinity, while lacking ion channel activity 
[180]. LY-303,870, Lanepitant 74, is a candidate for clinical 
development in animal models of inflammation, pain, 
migraine, and asthma [158]. 




\^/^OUt 76 

Fig. (25). Perhydroisoindolc-type NK| antagonists. 

RP-67,580 75 (Fig. (25)) emerged after lead optimization 
of an initial screening hit of Rhone-Poulenc Rorer's 
compound stock. RP-65.580 75 belongs to a class of 
substituted perhydroisoindoles which, apart from poor ora 
bioavailability, also suffered from L-type calcium channel 
interaction [151,181]. The follow-up compound RPR- 
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100,893 76 (Fig. (25)), Dapitant, exhibits superior binding 
affinity (IC 50 =13 nM, IM-9 cells) [182]. 

Investigations of the axially chiral l,7-naphthydrine-6- 
carboxamide 77 (Fig. (26)) revealed that the atropisomer 
(a/?)-trans-77 represents the bioactive receptor-bound 
conformation of this potent NKj antagonist [183]. This 
analogue exhibits in vitro antagonistic activities for the 
inhibition of [ l25 I]Bolton-Hunter(BH)-SP binding in human 
lymphoblast cells (IM-9) with an IC 50 value of 0.24 nM. 
Further, it shows in vivo potency by inhibiting capsaicin- 
induced plasma extravasation in the trachea of guinea pigs 
upon i.v. and p.a administration. 
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Fig. (27). Dual NK\/NK2 antagonists. 

Based on this template, Natsugari et at. [183] developed 
TAK-637 78 (Fig. (26)), the {aR.9R)' atropisomer of a cyclic 
naphthyridine analogue. TAK-637 78 exhibits an IC S0 value 
of 0.45 nM, an 1D 50 of 4.3 u.g/kg and 33 Hg/kg after i.v and 
pa administration, respectively. Further it increased the 
shutdown time of distension-induced bladder contractions 
and the bladder volume threshold in guinea pigs, thus 
implying its clinical potential in the treatment of pollakiuna 
and urinary incontinence [183]. The x-ray structures of 77 
and 78 provide insights in the prerequisite structural 



Fig. (26). Naphthydrinc-typc NKi antagonists. 
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Fig. (28). NK2 antagonists. 

requirements for NKj receptor binding, thereby assigning the 
(a/?,9tf >isomer as the active conformation [ 1 83]. 

Since the release of SP and NKA causes mucus secretion, 
airway constriction, and plasma extravasation - typical 
clinical symptoms of asthma - it has been suggested to use 
dual NK|/NK 2 antagonists in the treatment of asthma [184]. 

Considering the structural requirements of Sanofi's NK 2 - 
selective antagonist SR-48968 82 (Fig. (28), see below), 
researchers at Yamanouchi Pharm. developed the 
spiro[isobenzofuran]piperidine YM-35375 70 (Fig. (23)) 
with binding affinity towards the NK 2 receptor with an IC50 
value of 84 nM and an IC 30 value 710 nM for NK], 
respectively. Further, it shows inhibitory activity (ID5o=4 1 



Hg/kg, j.v.) against [P- AIa 8 ]NKA(4- 1 0)-induced 
bronchoconstriction in guinea pigs [185]. Utilizing this new 
NK1/NK2 dual antagonist as lead compound a further spiro- 
substituted piperidine analogue, YM-44778 79 (Fig. (27)), 
was developed, exhibiting potent antagonistic activities 
against the NKj (IC 50 =82 nM) and NK 2 (1C 50 =62 nM) 
receptors in isolated tissues [185], respectively. 

Based on L-tryptophanebenzyl esters, Qi et aL reported 
on the synthesis of two compounds 80, and 81 with dual 
NK1/NK2 receptor affinity (Fig. (27)) [186]. 

80 contains a 4-spiroindano piperidine and shows dual 
NK activity combined with slightly improved NK 2 activity 
(IC 50 =56 nM (/iNKi), IC 50 =27 nM (*NK 2 )). Upon 
incorporation of a 4-spiroindolin sulfonamide, the balanced 
antagonist 81 was obtained (IC50 = 14 nM - NKj; 24 nM - 
NK 2 ). 
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NK 2 Antagonists 

NK 2 antagonists are of particular interest for the treatment 
of chronic diseases such as asthma, inflammatory bowel 
disorders, rheumatoid arthritis, pain, emesis, and psychiatric 
disorders [157]. 

The first NK 2 antagonist, SR-48,968 82 (Fig. (28)), 
Saredutant, was described in 1992 [187]. This potent 
antagonist has been shown to inhibit the NKA-induced 
brochoconstriction in isolated human airways. Only recently, 
a study of van Schoor et ai have demonstrated that NKA- 
induced bronchoconstriction in asthmatics was significantly 
reduced with 100 mg Saredutant administered p.o [188], 

Based on this prototype compound, a number of 
analogues emerged from different laboratories. SR- 144, 190 
83 (Fig. (28)) retains the phenylpiperidine moiety but 
contains an additional morpholine unit in order to introduce 
rigidity. Compared to the parent compound, it exhibits a 
similar pharmacological profile with increased bioavailability 
intheCNS[189]. 

Also Yamanouchi (YM-38336) 84 and Zeneca (ZD-7944) 
85 (Fig. (28)) presented potent NK 2 antagonists based on the 
Sanofl lead structure (SR-48,968 82). ZD-7944 85 [190], 
showing a K\ value of 0.14 nM (MEL cells), still retains the 
phenylpiperidine entity, while YM-38336 84 [191] has been 
modified by introduction of a spiro-benzothiophene residue 
in position 4 of the piperidine. YM-38336 84 shows potent 
NK 2 inhibitory activity against (P-Ala 8 ]NKA(4-10)-induced 
bronchoconstriction in guinea pigs, demonstrated by an ID50 
value of 20 mg/kg, /.v. [192], 

Harrison et aL reported on the development of selective 
NK 2 and NK3 antagonists based on a common structural 




Fig. (29). NK3 antagonists. 
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template, notably the NK3-selective compound SR-142,801 
91 (Fig. (29), see below) [193]. Transfer of the carbonyl 
oxygen from an exocyclic to an endocyclic position on the 
piperidine ring led to two series of selective analogues, NK 2 
and NK 3 antagonists, respectively [193]. An example of a 
potent NK 2 antagonist is given by compound 86 (Fig. (28)) 
which exhibits an IC50 value of 2.2 nM for the displacement 
of [ I25 I]NKA from the cloned human NK 2 receptor in CHO 
cells. 

A number of preclinical nonpeptide NK 2 antagonists have 
been reported by Glaxo Wellcome, Rhone-Poulenc Rorer and 
Zeneca, e.g. GR- 159,897 87, RPR- 106, 145 88 (related to 
the NKj antagonist RPR- 100,893 76, (Fig. (25))), and ZM- 
253,270 89 (Fig. (28)) [158], respectively. 

Menarini used an interestingly rigid template for its 
selective NK 2 antagonists (£j=2.5 nM) MEN-11420 90, 
Nepadutant, exhibiting improved in vivo potency and 
duration which is attributed to its rigid structure [194]. 

The first selective nonpeptide NK3 antagonist, SR- 
142,801 91, Osanetant, has been reported by Sanofi 
(^=0.21 nM, CHO cells) (Fig. (29)) [195]. 

Based on this structural template, Merck Sharp and 
Dohme elaborated a series of NK 2 and NK3 antagonists, 
exemplified with analogue 92 (Fig. (29)), the corresponding 
congener of 86 (Fig. (28)). 

SmithKline Beecham claimed NK3 antagonists for the 
treatment of CNS diseases, pulmonary disorders and 
dermatitis [196). Based on a quinoline core structure, 
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Giardina et aL developed SB-223,412 « (Fig. (29)) 
demonstrating high NK 3 activity (IC 50 =>-2 nM,/C r I.U 
ruM, CHO cells), weak NK 2 activity, and no affinity for other 
receptors including ion channels [197]. SB-223,412 93 
exhibits w vitro and in vivo oral and intravenous activity in 
animal models [198]. 
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An entirely novel structure, 94 (Fig. (29)), has been 
claimed as NK 3 antagonist for the treatment of bronchitis, 
asthma, anxiety, Parkinson's disease and dermatitis [199]. 
Interestingly, this compound resembles strongly the indane 
carboxylic acids of SmithKline Beecham's ET antagonists. 
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Fig. (30). Miscellaneous Yj antagonists. 
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Neuropeptide Y 

Biomedical Significance 

The 36-amino acid peptide neuropeptide Y (NPY, Table 
I) was discovered in 1982 by Tatemoto ei al [200]. NPY is 
a member of the pancreatic polypeptide family, also 
including structurally related peptide YY (PYY) and 
pancreatic peptide (PP) [201]. NPY is widely distributed 
throughout the mammalian central and peripheral nervous 
system [202,203]. Interacting with its at least six receptor 
subtypes (Y|-Y$) it is involved in numerous physiological 
functions, e.g. food intake, blood pressure regulation, 
hormone secretion, sexual behaviour, and circadian rhythm 
[204-209]. Patent literature issued over the last ten years 
concentrate mainly on the inhibition of receptor-ligand 
interactions by low-molecular weight compounds in order to 
therapeutically interfere in mechanisms such as anxiety, 
appetite stimulation, obesity, alcohol intake, hypertension, 
and regulation of coronary tone [210]. As the Y| and Y 5 
receptors are suggested to control feeding behaviour, they are 
believed to be the best target systems for developing 
antagonists as therapeutics for the treatment of obesity 
[204,211-213]. The Y| receptor, found in the peripheral and 
in the central nervous system (CNS), has been cloned in 
1992 [214]. Its modulation may influence numerous 
physiological conditions including anxiety, diabetes, 
obesity, or appetite disorders. Most recently, the Y 5 receptor 
has been cloned and characterized to be involved in food 
intake regulation [212]. A review published by Ling in 1999 
reports on the patent situation related to NPY antagonists 
[210]. In this contribution representative examples of 
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potentially active nonpeptide NPY antagonists will be 
described according to their target receptors. 

YlRfTTP*™- Antagonists 

A number of Y| antagonists (Fig. (30)) published over 
the last ten years show binding affinities in the nanomolar 
range, e.g. as BIBP3226 95 (^=7.2 nM), SR120819 96 
(Ki=15 nM), PD160170 97 (^=48 nM), and LY-357897 98 
(A^O.75 nM) (Fig. (30)) [215-218]. The best characterized 
Y| antagonist BIBP3226 95 has been demonstrated to 
inhibit NPY-mediated vasoconstriction and pressure 
variations [215]. SR120819 96 represents a dipeptide 
analogue containing a sulfonamide. This orally active 
antagonist incorporating a central arginine mimic 
(benzamidine in 96) develops its potency in the \A-cis- 
disubstituted cyclohexyl ring by antagonizing NPY- 
mediated pressure responses [219]. 

Parke-Davis discovered a new and unique class of 
moderately potent but selective Y| antagonists by random 
screening of which PD 160 170 97 is a representative 
compound. Eli Lilly described LY-357897 98 from a series 
of trisubstituted indoles and benzimidazoles. Compound 99 
(Fig. (30)) [220] showing a K\ value of 2.1 [xM was 
discovered by a biased screening of the in-house library and 
served as lead structure in the subsequent SAR studies of the 
trisubstituted indole series. Consequent structure 
modification led to 98, the most active analogue (K x =0.15 
nM). which, in (S)-configuration inhibits NPY-induced 
forskolin-stimulated cAMP release and intracellular Ca 2+ 
release in the nanomolar range. The corresponding 
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Fig. (31). Bcnzazcpinone-type Y| antagonists. 
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bcnzimidazole series has also been investigated [221]. A 
representative example is given by compound 100 (Fig. 
(30)) which was obtained after systematic optimization of the 
Nl- and C4-substituents of the benzimidazole scaffold. 
Compound 100 exhibits in vitro binding affinity on AV-12 
cells expressing the human Y ( receptor with a K\ value of 
1.7 nM. 

Pfizer claimed a series of piperazinyl-comprising 
compounds as Y r selective antagonists [222]. Analogue 101 
(Fig. (30)) demonstrates an interesting activity profile by 
expressing a differentiated behaviour of the two conformers, 
i.e. c/s- (IC 5 o=76 nM) and trans- (1050=525 nM) exposed 
ethyl substituent with respect to the phenylpiperazine 
substitutent of the cyclohexyl ring. 

Warner Lambert filed compounds based on a quinoline 
scaffold that were claimed as Y } subtype selective 
antagonists. The 6-aryl-sulfonyl-quinoline analogue 102 
(Fig. (30)) inhibits [ ,25 I]PYY binding to the human Yj 
receptor with an K\ value of 48 nM [223]. 

Alanex Corp. claimed two series of compouds containing 
either an amidino-urea or a diamidino-urea core structure. A 
representative of the latter series is given by 103 (Fig. (30)) 
inhibiting the binding of [ ,25 1]PYY to the Y, receptor in 
membranes derived from human neuroblastoma cell lines 
(SK-N-MC) with an IC 50 value of 70 nM [224]. 

Bristol Myers Squibb's patents enclose two structurally 
related compound classes, i.e. phenyl-dihydropyridines [225] 
and phenyl-dihydropyrimidines [226]. In compound 104 
(Fig. (30)) the ^-substituted phenyl-dihydropyridine 
sidechain is terminated with a spiroindane, a structural 
element which is also found among other antagonists 
directed against numerous members of the peptide-binding 
GPCR superfamily. 

Murakami et al [227] at Shionogi published a novel 
class of 1,3-disubstitued benzazepi nones as potent and 
selective Y] antagonists. Based on the lead compound 105 
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Fig. (32). Y5 antagonists. 



Marion Gurrath 

(Fig. (31)) (K\=\.S |iM) which emerged from a random 
screening approach, follow-up compounds 106 (Kj=160 nM) 
and 107 (Kj=39 nM) have been obtained (Fig. (31)). 

Further optimization of the phenyl substituent in 
position 3 leading to analogue 108 as well as optimization 
of the substituent in position 3 of the 2,3,4,5-tetrahydro-l//- 
l-benzazepin-2-one, represented by congener 109 (Fig. (31)) 
resulted in an increase of the binding affinity towards 43 nM 
and 2.9 nM, respectively. Combination of the optimized 
structural features led to one of the most potent derivatives 
(110, Fig. (31)) which competitively inhibits specific 
[125jjpyy binding to Y] receptors in human SK-N-MC cells 
with a K\ value of 5.1 nM. Although 110 also antagonizes 
the Y| receptor-mediated increase in cytosolic free Ca 2+ 
concentration in SK-N-MC cells, it has not been evaluated 
in vivo because of its poor solubility in aqueous solution and 
poor oral bioavailability. Hence, it has been shown in 
binding assays with 17 receptors including the Y2, Y 4 , and 
Y 5 receptor that it binds selectively to the Y| receptor [227]. 

Receptor Antagonist 

Several patent applications have been filed by Novartis in 
1997 [228-230] claiming diamino quinazolines as selective 
Y5 antagonists. They were shown to inhibit NPY-induced 
Ca 2+ increase in stable transfected cells expressing the Y 5 
receptor. Analogue 111 (Fig. (32)) decreases food intake by 
60% in 24 h food deprived rats after Lp. administration of 30 
mg/kg. 

In 1998 Banyu Pharm. [231,232] and Bayer [233] filed 
patents including aminopyrazoles, aminopyridines and an 
amide based core structure as Y5 antagonists. The Banyu 
compounds 112 and 113 showed IC50 values for Y5 binding 
of 8.3 nM and 4.1 nM, respectively [2341, whereas the Bayer 
compound 114 binds with an IC50 value of 0.47 nM. Also 
this congener shows selective affinity for the Y5 receptor 
compared to Y ( , Y 2 , or Y 4 receptor subtypes (Fig. (32)) 
[234]. 
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STRUCTURAL-BASED DRUG DESIGN 

After having addressed the classical lead finding approach 
characterized by screening compound libraries with 
subsequent optimization, the complementary strategy of 
structure-based design will be highlighted, since this 
strategy is about to change the classical paradigm of "random 
versus rational" in favour of "random goes rational". Due to 
the fact that no high-resolution structure of any GPCR 
protein is available, all design attempts are still restricted on 
comparative analyses of structural features of biologically 
characterized low-molecular weight compounds which are 
interpreted in terms of steric and physicochemical 
complementarity to a hypothetical receptor binding site. 
Currently pursued GPCR research projects represent 
textbook examples for the fruitful combination of ligand- 
derived rationales that are incorporated into e.g. the design of 
combinatorial chemistry programs with the aim to direct 
resulting libraries more efficiently to the target class of 
interest, rather than attempting to explore systematically the 
infinite universe of molecular diversity. In the following, a 
few representative research efforts will be introduced that 
clearly attempt to change the mainstream of classical lead 
finding programs in favour of knowledge-based approaches. 



Somatostatin 

Somatostatin (Somatotropin Release-Inhibiting Factor, 
SRIF) (Table 1) was discovered because of its inhibitory 
effect on growth hormone secretion. The peptide hormone 
which exists in two biologically active forms, the 14 amino 
acid form (SRIF-14) and the 28 amino acid form (SRIF-28), 
acts as a neuromodulator [235]. 



Five receptor subtypes for somatostatin (sst|-sst5) have 
been cloned and characterized from human tissue [236]. 
Apart from its pivotal role as neuromodulator within the 
central nervous system (CNS), somatostatin alters the 
secretion of growth hormone (GH), insulin, glucagon, 
pancreatic enzymes, and gastric acid [237-240]. 
Consequently, analogues of somatostatin emerged as 
interesting tools in the treatment of disorders linked to the 
above mentioned physiological functions. Somatostatin 
agonists may therefore be used for the treatment of 
acromegaly, diabetes, cancer, rheumatoid arthritis, and 
Alzheimer's disease. Especially sst 2 -selective agonists 
emerged as useful candidates for the treatment of acromegaly, 
retinopathy, and diabetes [24 1,242]. 

The area of somatostatin agonist and antagonist research 
is a textbook example for indirect drug design utilizing 
ligand-derived structural rationales for design purposes. In 
the beginning of the 1990's numerous design projects were 
pursued aimed to replace the peptide scaffold of the 
pharmacophoric portion of somatostatin (SRIF-14) yielding 
a variety of moderately active, chemically diverse 
compounds. More recent lead finding programs employ the 
highly efficient technology of combinatorial chemistry for 
rapid modification of promising hits culminating in subtype- 
selective high-affinity binding compounds from a series of 
designed libraries. A brief overview of both, the rational 
design of single somatostatin-based peptidomimetics as well 
as the combinatorial chemistry-based approaches for lead 
identification and optimization will be given after a short 
description of the somatostatin-relevant pharmacophore 
hypothesis. 

The tetradecapeptide SRIF-14 115 (Fig. (33)), one of the 
widely distributed active forms of somatostatin, is believed 
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to adopt a two-stranded p sheet conformation induced by a P 
turn encompassing Phe 7 -Trp 8 -Lys 9 -Thr 10 , and the disulfide 
bridge between Cys 3 and Cys 14 , respectively (Fig. (33)). 
The conformation is further stabilized by the transannular H- 
bonding pattern typical for antiparallel sheet structures. From 



NH 2 



numerous sequence- and structure-activity studies it turned 
oul thai the primary pharmacophore consists of the p turn 
forming residues Phe 7 -Trp*-Lys 9 and an additional 
lipophilic binding element reminiscent to PhetyPhe 1 1 [243]. 
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Fig. (34). Peptide conformation-derived non-peptide somatostatin antagonists. The numbering scheme refers to that of SR1F-14 (see 
Fig. (33)). 
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Fig. (35). High-affinity fcsst2 antagonists derived by screening and subsequent optimization. 




The experimentally derived conformations of the 
metabolically more stable peptide analogues, e.g. octreotide 
(Sandostatin®) 116 [244,245] or L-363.377 117 [246,247] 
not only prove the pharmacophore hypothesis, but were 
further used as template structures underlying a series of 
rational design attempts (Fig. (33)). In 1992, researchers at 
Sandoz designed a tetra-substituted xylofuranose derivative 
118 (Fig. (34)) positioning the sidechains of Phe 7 -Trp 8 -Lys 9 
at its C-2, C-3, and C-5 atoms, while the benzyloxy group 
attached to C-3 resembles the aromatic sidechain of Phe lf , 
respectively (Fig. (34)) [248]. 

The xylose derivative 118 displaced radio-labelled 
octreotide 116 from its receptor with an IC50 of 23 |iM. 



Even though the mutual steric fit of the xylose-based mimic 
and the somatostatin structure was reasonable, the 
compounds displayed only moderate affinity which was 
attributed to the loss of considerable conformational entropy 
during receptor binding. Consequently, the design strategy 
at Sandoz was directed towards more rigid compounds based 
on nonpeptide scaffolds. For the purpose of substituting the 
peptide backbone of SRIF-14 within the P turn portion the 
privileged structure of the 1,4-benzodiazepinone was 
employed from which the pharamcophoric groups could 
radiate into the periphery [249], The resulting nonpeptide 
tetrapeptide-mimetic 119 (Fig. (34)) was designed to account 
for the sidechains of Phe 7 -Trp 8 -Lys 9 by the appropriate 
substituents, while the aromatic ring of the benzodiazepine 



Fig. (36). Side-by-side stereo presentation of the structural overlay of 123 (ba!l-and-stick mode) onto the experimentally-derived 
conformation of 117 (stick-mode). 
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core was believed to mimic the additional lipophilic element 
referring 10 Phe 6 /Phe' K respectively. However, the racemic 
mixture of 1 19 (benzodiazepinone) showed an IC50 of 7 HM, 
and even after separation, the L- and D-Trp containing 
benzodiazepinone displaced the radioligand with IC50 of 
only 6.5 jiM and 8.2 u.M, respectively. 

Similar affinities in the low micromolar range were 
obtained with peptidomimetics based on P-D-glucose 
scaffolding described by Hirschmann and Nicolaou at the end 
of the 80's and beginning of the 90's [250]. Molecular 
modeling studies carried out on the 3D structures of SRIF- 
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14 115 and analogues of L-363,377 117 suggested that 
substituents at C-2, C-l, and C-6 of a P-D-glucose template 
resemble the orienlational pattern of the P turn-forming 
amino acids of the somatoslatin-derived peptides. The 
corresponding penta-substiluted glucose 120 (Fig. (34) 
showed an IC50 of 15 nM. 

In 1996, researchers from Rhone- Poulenc Rorer published 
a similar approach of de-novo designed peptidomimetics 
employing aza-sugar-based templates for the spatially 
controlled orientation of the pharamcophoric amino acid 
sidechains [251]. Independent of ring size and substitution 
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L-799,976 
Ki = 0.05 nM 
h-sst2 selective 
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L-803,087 
Ki = 0.7 nM 
h-sst4 selective 
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L-817.818 
Ki = 0.4 nM 
h-sst5 selective 



Fig. (37). For each somatostatin receptor subtype (Asst|-Asst5) highly selective compounds emerged from rationally designed 
combinatorial libraries. 
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pattern, all analogues showed weak affinity with IC50 values 
in the range of 10-15 uM (see for example 121, Fig. (34)). 

Over the last two years, scientists at the Merck Researph 
Laboratories conducted a comprehensive program aimed to 
identify subtype-selective peptidomimetic compounds for 
each somatostatin receptor subtype (ssti-sst.5) by following a 
rational design strategy using a combination of classical 
medicinal chemistry with modern combinatorial chemistry 
techniques [252-256]. The primary lead, L-264,930 122 
(Fig. (35)), that initiated that combined approach, was 
identified by a virtual screening of the Merck sample 
collection. The 3D structure of the cyclic hexapeptide L- 
363,377 117 (Fig. (33)) served as spatial probe in that a 
geometric pattern, describing the arrangement of the 
pharmacophore groups, was derived by means of molecular 
modeling. After similarity searches, in which the sidechains 
of residues Tyr 7 -Trp 8 -Lys 9 were given priority for the 
pharmacophore definition, L-264,930 122 was uncovered 
with submicromolar affinity for the hssli receptor. 

This compound became the primary focus for medicinal 
chemistry and combinatorial chemistry at Merck. By 
constraining the floppy diamine chain with a 1,3-bis- 
aminomethyl-cyclohexane moiety the compound was 
optimized to yield L-054,264 123 (Fig. (35), Fig. (36)) with 
an IC50 of 1.6 nM for the /jsst2 receptor and a more than 
1000-fold selectivity over all other somatostatin receptor 
subtypes. 

Simultaneously, L-264,930 122 served as lead structure 
for a targeted combinatorial library. For library design the 
lead was dissected into three components, notably the central 
a-amino acid, the C-terminal blocking diamine, and the N- 
terminal blocking bulky urea-attached amine. The initial 
library was based on 20 a-amino acids, that were mainly 
analogues of Trp or carried modified aromatic sidechains. 
Additionally, 20 diamines were chosen in which the spacing 
between the two nitrogens varies between four and six 
atoms, also encompassing different ring topologies. The 
amine collection comprised 79 different entities that were 
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biased towards piperidines and piperazines containing 
additional aromatic rings, so-called "privileged structures". 
A solid-phase mix-and-split protocol was used to synthesize 
more than 130000 compounds in complex mixtures that 
demanded a deconvolution strategy. After several rounds of 
iterative optimization employing classical analoging as well 
as follow-up libraries, five compounds 124 - 128 emerged 
with the desired activity and selectivity profile, in that each 
compound is highly selective for a distinct somatostatin 
receptor subtype (Fig. (37)). 

This program impressively demonstrates the impact of an 
intelligent combination of structural rationales derived by 
comprehensive molecular modeling with the synthetic 
efficiency of current combinatorial chemistry techniques for 
lead finding attempts within modern medicinal chemistry. 

A further example of a peptidomimetics-based library 
employing structure rationales for identification of subtype- 
selective somatostatin analogues was published recently by 
J. Ellman and co-workers (Fig. (38)) [257]. By decoration of 
a medium-sized heterocyclic P turn rnimic with the Trp- and 
Lys-sidechain in positions i+ 1-1+2 and vice versa, together 
with an additional amine building block in i + J, a 
remarkably small library of only 172 entities (22 amines, 
D/L-Trp-D/L-Lys. D/L-Lys-D/L-Trp) uncovered a Asst 5 - 
selective compound 129 with an IC50 of 87 nM. 



Brady kin in 

Researchers at Sterling Winthrop considered angiotensin- 
converting-enzyme (ACE) inhibitors as templates for the 
design of BK B2 receptor antagonists [258], since ACE 
degrades both, angiotensin II (AH) and BK by cleaving the 
Pro 7 -Phe 8 amide bond. Therefore, an ACE inhibitor was 
considered to display properties or conformational 
similarities to BK, thus establishing a pharmacophore link 
between ACE and BK receptors in that both macromolecules 
recognize similar steric and physicochemical features. In 
order to test this hypothesis, the ACE inhibitor Quinapril 




>-CO-N-\ 






129 



Fig. (38). Left: / IS si 5 -sekciive compound derived from a p turn-templated library; right: side-by-side stereo presentation of ihc 
structural superposition of the P turn mimic (ball-and-stick mode) onto the pil' turn portion of 117 (stick-mode). 
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130 (Fig. (39)) 1 259] was chosen as template for the design 
and synthesis of a series of Ao/noPhe-Tic (Tic: 
telrahydrisoquinoline) containing compounds. The 
diastereomeres of 131 (Fig. (39)) exhibit binding affinities in 
the micromolar range (K\ = 1 \xM) in [ 3 H]BK binding 
studies with human IMR-90 fetal lung fibroblasts. 
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Fig. (39). Quinapril (130) served as template for the design of 
BK antagonists (e.g. 131). 

Goodfellow et al (260) followed a different approach in 
that they established a library based on a 3 turn template, 
CP-0597 132 (Fig. (40)) [261] which is a peptidic B(/B 2 
antagonist containing D-Tic and //-Chg (Chg: N- 
cyclohexylglyine) in /+/ and i+2 position of a 0H f turn. 
Starting from that structural rationale, the peptidomimetic 



CP-2055 133 (Fig. (40)) was generated. Based on the 1,4- 
piperzine scaffold a combinatorial library has been designed 
to produce approximately 2500 rationally directed diverse 
analogues (RDDA), 134 (Fig. (40)). 

This process led to the discovery of nonpeptide B 2 
antagonists serving as lead compounds for traditional 
optimization. While the parent peptidic analogue CP-0597 
132 shows an IC 50 value of 0.33 nM, CP-2055 133 exhibits 
an lC 5 o value of about 55 jiM on a cloned human B 2 
receptor. CP-2458 is a further a member of the designed 
library 134 and inhibits human B 2 receptor binding 
(1C 50 =4.1 pM) and BK-stimulated Ca 2+ flux in human 
fibroblasts (IC 50 =19 JiM). Unfortunately, the chemical 
formula of the compound is not given explicitly in the 
publication. 

Based on two structural templates (i) a cyclic hexapeptide 
BK antagonist 135 [262J and (ii) the nonpeptide BK 
antagonist WIN-64338 43 (Fig. (41)) [129], Dankwardt et 
al. [263] designed nonpeptide B 2 antagonists. While the 
hexapeptide served as structural template for the positioning 
of relevant functionality, WIN-64338 43 served as rigid 
scaffold for the design of a series of naphthylalanine 
containing derivatives, none of which showed improved 
affinity for the B 2 receptor when compared to WIN-64338 43 
(K$ = 44 nM. Substitution of the phosphonium group 
against the corresponding ammonium moiety resulted in a 
two-fold decrease in affinity for the B 2 receptor. However, the 
proposed structural superposition of the cyclic hexapeptide 
135 with the blocked amino acid derivative 43 provided a 
pharmacophore hypothesis that enabled Dankwardt and 
coworkers to design moderately active compounds and 
might serve as structural blueprint for further design 
attempts[263]. 



Neurokinin 

The structural feature of a reverse P rum has emerged to a 
general design principle underlying a variety of GPCR 
antagonist projects. P truns play an important role in 
recognition phenomena as documented e.g. for somatostatin 
and NKA which bind to their receptors in a proposed P turn 
conformation. Therefore Horwell et al. [264] at Parke-Davis 
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Fig. (40). Design strategy of BK antagonists following the "rationally directed diverse analogues" approach. 
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Fig. (41). Rationally designed BK antagonists. 
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decided to employ P turn mimetics for the design of 
compounds with affinity for the NK 2 receptor. Starting from 
the x-ray structure of MEN-10627 136 (Fig. (42)) [265], a 
cyclic hexapeptide displaying high NK 2 affinity, a 
pyrrolidine-based Trp-Phe dipeptide mimetic 137 has been 
designed (Fig. (42)). 

The Trp-Phe dipeptide scaffold mimics the Trp-Phe 
1 fragment in the central portion (/+/, /+2) of a pi turn within 
the cyclic hexapeptide which folds into a Pl/pil turn 
conformation. Although the indole and benzyl sidechains of 
both compounds superimpose satisfactory, 137 did not show 
significant NK 2 receptor affinity. The lack of affinity has been 
attributed to the misfit of the dipole moments of both 
molecules. In order to address this problem in more detail, a 
further Trp-Phe dipeptide mimetic 138 (Fig. (42)) has been 
designed by computer-assisted molecular modeling 
identifying a 2-azabicyclonorbornan spacer to be more 
favourable compared to the pyrrolidine (Fig. (43)). 
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Comparison of the binding affinities revealed that the 
conversion of the hexapeptide to a dipeptide unit results in 
the loss of high binding affinity (MEN 10627 136: 
IC 50 =0.079 nM (NK 2 ); 137: IC 50 =14% @ 10 U.M (NK 2 ); 
138: IC 50 =31% @ 10 nM (NK 2 )) studied by displacement 
assays with [ l25 I]NKA in hamster urinary bladder. On the 
other hand, [ ,25 I]BH-SP displacement from NKj in human 
IM-9 cells of MEN-10627 136 (IC 50 =0.8 u.M) is retained by 
137 and 138 with IC 50 values of 3.7 fiM and 6.7 \iU % 
respectively. Interestingly, the dipeptide mimetics exhibit 
some binding affinity to human NK.3 receptors stably 
expressed in CHO cells shown by replacement of [ ,25 I]- 
[MePhe 7 ]NKB (137: IC 50 (NK 3 )=3.5 jxM; 138: 
IC 5 p(NK3)=35% @ 10 |iM) while the parent hexapeptide 
exhibited no NK3 affinity at all. 

Only recently, Porcelli et ol. [266] presented the design 
of a SP antagonist based on a cyclic pentapeptide with the 
chirality sequence following a D l L 2 D 3 D 4 L* pattern. The 





136 



f'g* (42). Peptide structure-derived rationales were used to design non-peplide NK antagonists (Dap: 2,3-diamtnopropanoic acid). 
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Fig* (43)- Side-by-side stereo presentation of the structural overlay 
mode) within the turn corresponding portion. 

authors suggest this scaffold as a generic template to design 
antagonists also for other members of the GPCR family. 
This suggestion is the logical consequence of the fact that 
among potent GPCR antagonists the same unique skeleton 
is found among other representatives of antagonists for 
peptide-binding GPCRs, e.g. the natural pentapeptide BE- 
18257B (cycMD-allo-Ile-Leu-D-Trp-D-GIu-Ala.)) and its 
synthetic analogue BQ-J23 (cyc/o-(D- Val-Leu-D-Trp-D- Asp- 
Pro-)) [267], a prominent ET A antagonist. Both cyclic 
pentapeptides follow the chiral sequence pattern of DLDDL. 
The solution structure of BQ-123 [268] exhibits a typical 
Pll/Yi turn arrangement characteristic for this class of 
molecules. Based on the same structural template, Porcelli et 
aL designed a SP antagonist, ITF-1565 (cyclo-{D-7rp } -?ro 2 - 
D-Lys 3 -D-Trp 4 -Phe 5 -)) which inhibits NKj-mediated SP- 
induced contraction of the rabbit caval vein. ITF-1565 only 
shows modest NK2 activity and was inactive in ET A assays. 
ITF-1565 exhibits a pil/y turn arrangement with Pro 2 in /+7 
and D-Lys 3 in r+2 position of the P turn and Phe 5 in the 




139 

Fig. (44). Glucose- based peptidomimetic NK analogue. 



of 138 (ball-and-stick-mode) onto the x-ray structure of 136 (stick- 
central position of the y turn. Interestingly, the author 
succeeded to superimpose the sidechain functionalities of D- 
Trp 4 , Phe 5 and D-Trp 1 within ITF-1565 well onto the 
indole and benzyl rings within a [J-D-glucose derived SP 
antagonist 139 (Fig. (44)). 



Luteinizing Hormone-Releasing Hormone 

The decapeptide amide Luteinizing Hormone-Releasing 
Hormone (LHRH, Table 1) [269], pGlu-His-Trp-Ser-Tyr- 
Gly-Leu-Arg-Pro-Gly-NHj, is released from the 
hypothalamus and stimulates the anterior pituitary gland 
resulting in the secretion of the gonadotropins luteinzing 
hormone (LH) and follicle-stimulating hormone (FSH). 
LHRH, also termed gonadotropin-releasing hormone, plays 
an important role in the regulation of reproductive functions, 
thus rendering its synthetic analogues useful tools for the 
treatment of endocrine-based diseases like prostate and breast 
cancer, endometriosis, uterine leiomyoma, and precocious 
puperty [270], Even though LHRH agonists proved to be 
useful in the treatment of the above mentioned diseases [271- 
273], research has also focused on the development of potent 
and safe antagonists. 

Recently, Takeda presented a substituted 4- 
oxothieno[2,3-A]pyridine as a highly potent and orally active 
nonpeptide antagonist of the human LHRH receptor [274]. 
Again, this research program was based on the structural 
characteristics of a P turn suggested as the dominant 
conformational feature within [5-8JLHRH (Fig. (45)) 
[272,273]. 
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F'g- (*5). P turn-derived design strategy uncovered highly active non-peptide LHRH analogues. 
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The P turn is considered to represent the bioactive 
conformation of LHRH in the receptor-bound state. 
Therefore, the structural element of a p rum was attempted to 
be transferred onto a rigid scaffold which mimics the P turn 
and can be decorated with the crucial functionalities, thus 
positioning them into the receptor-complementary 
orientation (Fig. (45)). For this purpose, a directed screening 
approach was initiated aimed to uncover compounds 
showing similarity to the turn template. The screening 
towards the inhibitory effect on the specific binding of 
[ ,25 I]leuprolelin to human LHRH receptor [275J expressed in 

, CHO cells resulted in the initial lead compound 140 (Fig. 

>(45)) [274]. 

This compound was structurally compared to the 
hypothesized P turn arrangement and changed in order to 
fulfil the structural requirement imposed by that template, 




e.g. substituting Gly by hydrophobic D -amino acids 
increased activity presumably due to stabilization of the P 
turn by introducing a D-amino acid into the j+ / position of 
the P turn. Subsequent modifications finally led to the 
discovery of T-98475 141 (Fig. (45)) exhibiting an IC 50 
value of 0.2 nM for the binding to the cloned human LHRH 
receptor. Further, T-98475 141 shows inhibitory effects on 
LHRH-stimulated LH release in functional in vitro and in 
vivo assays. Thus, T-98475 141 is a good candidate of a 
new class of therapeutics for the treatment of LH-induced 
dysfunctions in sex-hormone-dependent pathologies. 



C5a 

The 74 amino acid peptide C5a (Table 1 ) is released after 
activation of the complement system at sites of inflammation 
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Fig. (46). CSa analogues. 
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by proteolytic cleavage of the complement factor CS [276]. 
The hormone-like peptide anaphylatox.n C5a acts as 
chemotaxin by attracting and promoting the degranulation of 
granulocytes and macrophages during immune response 
[277 278]. Inappropriate activation of C5a results in a 
number of inflammatory diseases including rheumatoid 
arthritis [279], Alzheimer's disease [280], ischemic heart 
failure [281], psiorasis [282], atherosclerosis [283], and 
adult respiratory distress syndrome (ARDS) [284]. In this 
sense, agents preventing the interaction of C5a and its 
receptor, C5aR, would be useful for inhibition of the pro- 
inflammatory function of C5a, thus, being a useful 
therapeutic in the treatment of chronic inflammatory 
disorders induced by activation of the complement system 
and the release of C5a [285,286]. The binding of the small 
protein C5a to its receptor is characterized by two interaction 
sites. A two-site model has been proposed localizing the 
major binding epitope for the ligand C5a in the extracellular 
TV-terminal region of the receptor, while the second binding 
cavity is located in the core of the transmembrane helix 
bundle, obviously serving as the "activating binding site" 
recognizing the C-terminal octapeptide of the ligand 
[287,288], Starting from the sequence of the native ligand a 
number of peptide-based antagonists were discovered which 
have been reviewed only recently by Wong et al, [289]. 
Obviously, the development of a nonpeplide antagonist in 
this filed is a major challenge since research revealed only 
low molecular weight compounds acting as C5a agonists or 
at least partial agonists over the last two decades. 

Merck identified an initial lead 142 (Fig. (46)) by 
screening an in-house sample collection for the displacement 
of [ ,25 l]C5a from human neutrophil membrane preparations 
which served for further optimization [290]. 

The spiroindane-bearing hydantoin 142 has been 
modified by introduction of a cyclohexylmethyl group 
instead of the benzyl residue resulting in compound 143 
(Fig. (46)) which exhibits an 1C 50 value of 0.3 pM. 

Surprisingly, functional receptor assays revealed that all 
compounds of this series with affinity for C5aR showed an 
agonistic potential. The only nonpeptide antagonists have 
been reported by Merck investigating 4,6-diaminoquinolines 
(144) [291] and Rhone-Poulenc Rorer identifying a 
phenylguanidin by random screening (145, IC 5 o=0.8 U.M) 
(Fig. (46)) [292]. 

As random screening techniques have not brought the 
expected success, rational design would offer an alternative in 
the lead finding process for C5a antagonists. Based on the 
results of conformational studies of cyclic pentapeptide ET 
antagonists, BE-18257B and BQ-123 [293,2941, Wong and 
co-workers [295,296] followed the same strategy as presented 
by Porcelli et al [266] for the design of the SP antagonist, 
ITF-1565. BQ-123, ^^-(D-VaP-Leu^D-Trp^D-Asp^- 
Pro 5 -) and ITF-1565, cyc/o- (D-Trp 1 -Pro 2 -D-Lys 3 -D-Trtr- 
Phe 5 -) follow an identical chirality pattern of O l y D'D L 
leading to a pil/y (i) turn arrangement with L 2 -D 3 in /+ / and 
i+2 position of the p turn and L 5 in the central position of 
an (inverse) y turn. The strategy seems also to be applicable 
to C5a, since the C-terminal-derived C5a antagonist NMe- 
Phe-Lys-Pro-D-Cha-Trp-D-Arg (Cha: cyclohexylalanine) 
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shows a well defined structure in solution in which the 
lysine sidechain is in close proximity to the D-arginine 
carboxylate. Ring closure resulted in a backbone-to-sidechain 
cyclized peptide, cyc/a-Ac-Phe-(Orn-Pro-D-Cha-Trp-D-Arg-) 
(brackets indicate the sidechain-to-backbone mode of 
cyclization, Om-NH E -CO-D-Arg) with an IC50 value of 9.28 
HM for the displacement of [ 125 l]C5a from human 
polymorphonuclear (PMN) cells. Conformational analysis 
revealed a y turn with Pro in the central position stabilized 
by a hydrogen bond between the flanking amino acids, Orn- 
CCT'HN-D-Cha, together with a "pseudo" fill turn involving 
D-Cha-Trp-D-Arg-Om defined by a second hydrogen bond 
between D-Cha-CO "H £ N-0m. This is consistent with 
and $i+2/Vi+2 dihedrals of Trp and D-Arg (- 
58°/90°; 69°/-3°) confirming a |3 turn type II (ideal values: 
-60°/120°; 8070°) arrangement [295]. More detailed SAR 
studies showed that the L-Arg containing isomer is much 
more active than the D-Arg congener (1C50= 2 0 nM; 
inhibition of C5a-induced release of myeloperoxidase from 
PMNs). The NMR-derived solution structure reveals an 
inverse y turn (yj) involving D-Cha-Trp-Arg stabilized by a 
hydrogen bond between D-Cha-CO" HN-Arg [296]. 



CONCLUSION 

This review was intended to highlight not only the 
relevance of the GPCR superfamily for drug development 
purposes during the last decade, but also the tremendous 
potential of that particular target class for- future medicinal 
chemistry programs aimed to uncover new ligands for 
peptide-binding GPCRs. Especially the cross-fertilizing 
combination of ligand-derived structure rationales with the 
dramatically enhanced efficiency of automated synthesis and 
combinatorial chemistry will enable pharmaceutical research 
to identify new chemical entities more rapidly. Even though 
we have witnessed a technology-based quantum leap forward 
in efficiency within medicinal chemistry in the late 1990's, 
the vigorous search for novel GPCR genes within e.g. the 
human genome has far outpaced the identification of novel 
endogenous and exogenous ligands. The identification of 
these ligands remains one of the most challenging tasks in 
modern pharmacology. The number of GPCRs for which 
endogenous or exogenous ligands are unknown today 
continues to increase, thus offering modern pharmaceutical 
research new opportunities in that entirely new drug targets 
associated with innovative therapeutic principles emerge. In 
this context, new low-molecular weight ligands for these 
orphan receptors will undoubtedly lead to novel insights 
into the complexity of numerous poorly understood human 
disorders. Consequently, targeted medicinal chemistry 
approaches towards members of the GPCR family will 
facilitate the understanding of the precise physiological role 
of orphan receptors as well as produce new compounds as 
qualified lead structures for clinical development. 

Concluding, the field of GPCR research is clearly 
expected to grow dramatically due to the progress that will 
be made in the human genome initiative, demanding 
increased contributions from medicinal chemistry in order to 
provide new pharmacological tools as well as new leads for 
the development of new drugs. 
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THE HUMAN GENOME 

the human genome was generated by tne ww J g nths from 

27.271,853 high-quality sequence reads (M 1 ** & ^duals. Two 

from both ends of pbsm.dc one made f om the dna chromosome 
assembly strategies-a ^J^^^iti from Celera and the 
assembly-were used, ^^TSS^^ shredded Wo " 5 J?* P 
publicly funded genome effort. The P«*« « reg ionsthathad,been 
segments to create a 2.9-fod coverage o ^f^™ do 8 ning and assembly 
sequenced, without including *"?J^£ZZ#t the effective cov- 
prLdure used by the pu.bl.cl> r funde ^ J£ and size of gaps in , 

erage in the. assemblies to e.ghtfo Id re W^JJ"^^, average. The 
the final assembly over ^J^^Ste that largely agree with 
two assembly ^^^^^^f^Xy cover the euchromatic 
independent mapping data. Theassemb Jes effeca y fe jn 

regions of the human chromosomes. More > wan a . |n 

Sffold assemblies oT 1^ ^ « ^of tie genom seqofnce revealed 
scaffolds of 10 million bp or ^rger. Analysis o^ the g en S corroborati ng 

evidence and an additional ~1 Zjoo ^mp ' gene .dense dusters are 

matches or other weak supporting ^^^Tclfc seqU ence separated 
obvious, almost half the genes ^^ "^ < of ^ genorne 
by Urge tracts of apparently J^^S^ of the genome being 
is spanned by exons, whereas 24 /o is in i irrtrons, w ^ chrQ _ 

intergenic DNA. Duplications of : segmental blocks rang, g P 

mosomal lengths, are abundan " hr ° U ^ indicates vertebrate ex- 

evolutionary histo*. Comparative genomic ^s.s (fc ^ 

pansions of genes associated r^J^/^SS immune systems. DNA 
velopmental regulation ^and wit * [^™^™* uence an d publicly funded 
sequence comparisons between the -consensu ^ leotid lymorph isms 
genome data provided locat.ons of 2. m.U.on ' *'^ e J „ ^ of , bp pe r 

(SNPs). A random pair of human ""P^^^'X in the level of poly-. 

1250 on average, but there was ^ ^5K^r^lnv.rWionln 

remains an open challenge. 



Decoding of the DNA that constitutes he 
human genome has been widely anticipated 
for the contribution it will make toward un- 
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demanding human evolution, the causation 
o Sase 8 and the interplay between > the 
environment and heredity in defining the hu- 
m an condition. A project with the goal of 
determining the complete nucleoude se- 
quence of the human genome was first for- 
mally proposed in 1985 (/). In subsequent 
yeare Ae idea met with mixed reactions m 
^scientific community (2). However in 
1990 the Human Genome Project (HGP) was 
officially initiated in the United States under 
the dJction of the National 
Health and the U.S. Department of Energy 
with a 1 5-year, S3 billion plan for completing 
Th genome sequence: In 1998 we announced 
our intention to build a unique genome- 
sequencing facility, to determine the se- 
ouence of the human genome over a 3-year 
period. Here we report the penultimate mi e- 
stone along the path toward that goal, a nearly 
complete sequence of the euchromatic por- 
° 0 n of the bunw genome. The sequencing 
was performed by a whole-genome random 
shotgun method with subsequent assembly of 
the sequenced segments. 

The modem history of DNA sequencing 
began in 1977, when Sanger reported his meth- 
od for determining the order of nucleotides of 



DNA using cham-tenrunating nucleotide ana- 
logs (3). In the same year, the first human jene . 
was isolated and sequenced (4). In 1986, Hood 
and co-workers (5) described an improvement 
in the Sanger sequencing method that included 
attaching fluorescent dyes to the nucleotides 
which permitted them to be sequentially read 
by a computer. The first automated DNA se- 
quencer, developed by Applied Biosystems m 
California in 1987, was shown to be successful 
when the sequences of two genes were obtained ... 
with this new technology (6), From early ^se-. 
quencing of human genomic regions (7), it 
became clear that cDNA sequences (which are 
reverse-transcribed from RNA) would be es- 
sential to annotate and validate gene predictions 
in the human genome. These studies were the 
basis in part for the development of the ex- 
pressed sequence tag (EST) method of gene 
identification (5), which is a random selection,, 
very high throughput sequencing approach to 
characterize cDNA libraries. The EST method 
led to the rapid discovery and mapping ot hu- 
man genes (9). The increasing numbers of hu- 
man EST sequences necessitated the deve op- 
ment of new computer algorithms to analyze 
large amounts of sequence data, and m 1993 * 
The Institute for Genomic Research (TIGR), an 
algorithm was developed that permitted assem- 
bly and analysis of hundreds of thousands of 
ESTs This algorithm permitted characteriza- 
tion and annotation of human genes on the basis 

of 30,000 EST assemblies {10). 

The complete 49-kbp bacteriophage lamb- . 

da genome sequence was deterauned by a 
shotgun restriction digest method in 1982 
111). When considering methods for sequenc- 
ing the smallpox virus genome in 1991 (U), 
a whole-genome shotgun sequencing method 
was discussed and subsequently rejected ow- 
ing to the lack of appropriate software tools 
for genome assembly. However, in 1994 
when a microbial genome-sequencing project 
was contemplated at TIGR, a whole-genome 
shotgun sequencing approach was considered 
possible with the TIGR EST assembly algo- 
rithm. In 1995, the 1.8-Mbp Haemophilus 
influenzae genome was completed by a 
whole-genome shotgun sequencing method 
(13). Tne experience with several subsequent 
genome-sequencing efforts, established the 
broad applicability of this approach (14, 15). 

A key feature of the sequencing approach 
used for these megabase-size and larger ge- 
nomes was the use of paired-end sequence 

(also called mate pairs), derived from sub 
clone libraries with distinct insert sizes and 
cloning characteristics. Paired-end sequences 
are sequences 500 to 600 bp u. lengfli from 
both ends of double-strandedDNA clones of 
bribed length, The success 
sequences from long segments (18 to 20 kbp) 
of DNA cloned into bacteriophage lambda m 

suggestion (16) of an approach to sunulta 
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neously map and sequence the human ge- 
nome by means of end sequences from 150- 
kbp bacterial artificial chromosomes (BACs) 
(77, 18). The end sequences spanned by 
known distances provide long-range continu- 
ity across the genome. A modification of the 
BAC end-sequencing (BES) method was ap- 
plied successfully to complete chromosome 2 
. from the Arabidopsis thaliana genome {19). 

In 1997, Weber and Myers (20) proposed 
whole-genome shotgun sequencing of the 
human genome. Their proposal was not well 
received (21). However; by early 1998, as 
less than 5% of the genome had been se- 
quenced, it was clear, that the rate of progress . 
in human genome sequencing worldwide 
was very slow (22), and the prospects for 
finishing the genome by the 2005 goal were 
uncertain. 

In early 1998, PE Biosystems (now Applied 
Biosystems) developed an automated, high- 
throughput capillary DNA sequencer, subse- 
quently called the AB1 PRISM 3700 DNA 
Analyzer. Discussions between PE Biosystems 
and TIGR scientists resulted in a plan to under- 
take the sequencing of the human genome with 
the 3700 DNA Analyzer and the whole-genome 
shotgun sequencing techniques developed^ at 
TIGR (23). Many of the principles of operation 
of a genome-sequencing facility were estab- 
lished in the TIGR facility (24). However, the 
facility envisioned for Celera would have a 
capacity roughly 50 times that of TIGR, and 
thus new developments were required for sam- 
ple preparation and tracking and for whole- 
genome assembly. Some argued that the re- 
quired 150-fold scale-up from the H. influenzae 
genome to the human genome with its complex 
repeat sequences was not feasible (25). The 
Drosophila melanogaster genome was thus 
chosen as a test case for whole-genome assem- 
bly on a large and complex eukaryotic genome. 
In collaboration with Gerald Rubin and the 
Berkeley Drosophila Genome Project, the nu- 
cleotide sequence of the 120-Mbp euchromatic 
portion of the Drosophila genome was deter- 
mined over a 1-year period (26-28). The Dro- 
sophila genome-sequencing effort resulted in 
two key findings: (i) that the assembly algo- 
rithms could generate chromosome assemblies 
with highly accurate order and orientation with 
substantially less than 10-fold coverage, and (ii) 
that undertaking multiple interim assemblies in 
place of one comprehensive final assembly was 
not of value. 

These findings, together with the dramatic 
changes in the public genome effort subsequent 
to the formation of Celera (29), led to a modi- 
fied whole-genome shotgun sequencing ap- 
proach to the human genome. We initially pro- 
posed to do 10-fold sequence coverage of the 
genome over a 3-year period and to make in- 
terim assembled sequence data available quar- 
terly. The modifications included a plan to per- 
form random shotgun sequencing to -5-fold 
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coverage and to use the unordered and unori- 
ented BAC sequence fragments and subassem- 
blies published in GenBank by the publicly 
funded genome effort (30) to accelerate the 
project. We also abandoned the quarterly an- 
nouncements in the absence of interim assem- 
blies to report. 

Although this strategy provided a reason- 
able result very early that was consistent with a 
whole-genome shotgun assembly with eight- 
fold coverage, the human genome sequence is 
not as finished as the Drosophila genome was 
with an effective 13-fold coverage. However, it 
became clear that even with this reduced cov- 
erage strategy, Celera could generate an accu- 
rately ordered and oriented scaffold sequence of 
the human genome in less than 1 year. Human 
genome sequencing was initiated 8 September 
1999 and completed 17 June 2000. The first 
assembly was completed 25 June 2000, and the 
assembly reported here was completed 1 Octo- 
ber 2000. Here we describe the whole-genome 
random shotgun sequencing effort applied to 
the human genome. We developed two differ- 
ent assembly approaches for assembling the -3 
billion bp that make up the 23 pairs of chromo- 
somes of the Homo sapiens genome. Any Gen- 
Bank-derived data were shredded to remove 
potential bias to the. final sequence from chi- 
meric clones, foreign DNA contamination, or 
misassembled contigs. Insofar as a correctly 
and accurately assembled genome sequence 
with faithful order and orientation of contigs 
is essential for an accurate analysis of the 
human genetic code, we have devoted a con- 
siderable portion of this manuscript to the 
documentation of the quality of our recon- 
struction of the genome. We also describe our 
preliminary analysis of the human genetic 
code on the basis of computational methods. 
Figure 1 (see fold-out chart associated with 
this issue; files for each chromosome can be 
found in Web fig. 1 on Science Online at 
www.sciencemag.org/cgi/content/full/29 1 / 
5507/1304/DC1) provides a graphical over- 
view of the genome and the features encoded 
in it. The detailed manual curation and inter- 
pretation of the genome are just beginning. 

To aid the reader in locating specific an- 
alytical sections, we have divided the paper 
into seven broad sections. A summary of the 
major results appears at the beginning of each 
section. 

1 Sources of DNA and Sequencing Methods 

2 Genome Assembly Strategy and 
Characterization 

3 Gene Prediction and Annotation 

4 Genome Structure 

5 Genome Evolution 

6 A Genome-Wide Examination of 
Sequence Variations 

7 An Overview of the Predicted Protein- 
Coding Genes in the Human Genome 

8 Conclusions 




1 Sources of DNA and Sequencing 
Methods fi 

Summary. This section discusses the rationale 
and ethical rules governing donor selection to 
ensure ethnic and gender diversity along with 
the methodologies for DNA extraction and |>. 
brary construction. The plasmid library con- 
struction is the first critical step in shotgun 
sequencing. If the DNA libraries are not uni- 
form in size, nonchimeric, and do not randomly 
. represent the genome, then the subsequent steps 
cannot accurately reconstruct the genome se- 
quence. We used automated high-throughput 
DNA sequencing and the computational infra- 
structure to enable efficient, tracking of cnor*. 
mous amounts of sequence information (27.3 
million sequence reads; 14.9 billion bp of se- 
quence). Sequencing and tracking from both 
ends of plasmid clones from 2-, 10-, and 50-kbp 
libraries were essential to the computational 
reconstruction of the genome. Our evidence 
indicates that the accurate pairing rate of end 
sequences was greater than 98%. 



Various policies of the United States and the 
World Medical Association, specifically the , 
Declaration of Helsinki, offer recommenda- 
tions for conducting experiments with human 
subjects. We convened an Institutional Re- 
view Board (IRB) (31) that helped us estab- 
lish the protocol for obtaining and using hu- 
man DNA and the informed consent process 
used to enroll research volunteers for the 
DNA-sequencing studies reported here. We 
adopted several steps and procedures to pro- 
tect the privacy rights and confidentiality of 
the research subjects (donors). These includ- 
ed a two-stage consent process, a secure ran- 
dom alphanumeric coding system for speci- 
mens and records, circumscribed contact with 
the subjects by researchers, and options for 
off-site contact of donors. In addition, Celera 
applied for and received a Certificate of Con- 
fidentiality from the Department of Health 
and Human Services. This Certificate autho- 
rized Celera to protect the privacy of the 
individuals who volunteered to be donors as 
provided in Section 301(d) of the Public 
Health Service Act 42 U.S.C. 241(d). 

Celera and the IRB believed that the ini- 
tial version of a completed human genome 
should be a composite derived from multiple 
donors of diverse ethnic backgrounds Pro- 
spective donors were asked, on a voluntary 
basis, to self-designate an ethnogeographic 
category (e.g., African-American, Chinese, 
Hispanic, Caucasian, etc.). We enrolled 21 
donors (32). 

Three basic items of information from 
each donor were recorded and linked by con- 
fidential code to the donated sample: age, 
sex, and self-designated ethnogeographic 
group. From females, -130 ml of whole, 
heparinized blood was collected. From males, 
-130 ml of whole, heparinized blood was 
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collected, as well as five specimens of semen, 
collected over a 6-week period. Permanent 
lymphoblastoid cell lines were created by 
Epstein-Barr virus immortalization. DNA 
fiom five subjects was selected for genomic 
rbNA sequencing: two males and three fe- 
males— one African-American, one Asian- 
Chinese, one Hispanic-Mexican, and two 
Caucasians (see Web fig. 2 on Science Online 
at www.sciencemag.org/cgi/content/291/5507/ 
1304/DC1). The decision of whose DNA to 
sequence was based on a complex mix of fac- 
tors, including the goal of achieving diversity as 
well as technical issues such as the quality ot 
the DNA libraries and availability of immortal- 
ized cell lines. 

1.1 Library construction and 
sequencing 

Central to the whole-genome shotgun sequenc- 
ing process is preparation of high-quaUty plas- 
mid libraries in a variety of insert sizes so that 
pairs of sequence reads (mates) are obtained, 
one read from both ends of each plasmid insert. 
High-quality libraries have an equal representa- 
tion of all parts of the genome, a small number 
of clones without inserts, and no contamination 
from such sources as the mitochondrial genome 
and Escherichia coli genomic DNA. DNA from 
each donor was used to construct plasmid librar- 
ies in one or more of three size classes: 2 kbp, 10 
kbp, and 50 kbp (Table 1) (53). 

In designing the DNA-sequencing pro- 
cess we focused on developing a simple 
system that could be implemented in a robust 
and reproducible manner and monitored et- 
fectively (Fig. 2) (34), 

Current sequencing protocols are based on 
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the dideoxy sequencing method (35), which 
typically yields only 500 to 750 bp of sequence 
per reaction. This limitation on read length has 
made monumental gains in throughput a pre- 
requisite for the analysis of large eukaryotic 
genomes. We accomplished this at the Celera 
facility, which occupies about 30,000 square 
•feet of laboratory space and produces sequence 
data continuously at a rate of 175,000 total 
reads per day. The DNA-sequencing faculty is 
supported by a high-performance computation- 

al facility (36). ' ".. V. 

• The process for DNA sequencing was mod- 
ular by design and automated, Intermodule 
sample backlogs allowed four principal 
modules to operate independently: (i) li- 
brary transformation, plating, and colpny 
picking; (ii) DNA template preparation; 
(iii) dideoxy sequencing reaction set-up 
and purification; and (iv) sequence deter- 
mination with the ABI PRISM 3700 DNA 
Analyzer. Because the inputs and outputs 
of each module have been carefully 
matched and sample backlogs are continu- 
ously managed, sequencing has proceeded 
without a single day's interruption since the 
initiation of the Drpsophila project in May 
1999 The ABI 3700 is a frilly automated 
capillary array sequencer and as such can 
be operated with a minimal amount ot 
hands-on time, currently estimated at about 
15 min per day. The capillary system also 
facilitates correct associations of sequenc- 
ing traces with samples through the e imi- 
tation of manual sample loading and lane- 
tracking errors associated with slab geK 
About 65 production staff were hired and 
trained, and were rotated on a regular basis 



through the four production modules. A 
central laboratory information management 
system (LIMS) tracked all sample plates by 
unique bar code identifiers. The facility was 
supported by a quality control team that per- 
formed raw material and in-process testing 
and a quality assurance group with responsi- 
bilities including document control, valida- 
tion, and auditing of the facility. Critical to 
the success of the scale-up was the validation 
of all software and instrumentation before . 
implementation, and production-scale testing 
of any process changes. 

1.2 Trace processing 

An automated trace-processing pipeline has 
been developed to process each sequence file 
(37). After quality and vector trimming, the 
average trimmed sequence length was 543 
bp, and the sequencing accuracy was expo- 
nentially distributed with a mean of 99.5% 
and with less than 1 in 1000 reads being less 
than 98% accurate (26), Each trimmed se- 
quence was screened for matches to contam- 
inants including sequences of vector alone, E. 
coli genomic DNA, and human mitochondri- 
al DNA. The entire read for any sequence 
with a significant match to a contaminant was 
discarded. A total of 713 reads matched E. 
coli genomic DNA and 2114 reads matched 
the human mitochondrial genome. 

1.3 Quality assessment and control 

The importance of the base-pair level ac- 
curacy of the sequence data increases as the 
size and repetitive nature of the genome to 
be sequenced increases. Each sequence 
read must be placed uniquely in the ge- 



Table 1. Celera-generated data input into assembly. 



Number of reads for different insert libraries 



Individual 




om 
on- 
ige, 

# 



No. of sequencing reads 



Fold sequence coverage 
(2.9-Gb genome) 



Fold clone coverage 



insert size* (mean) 
Insert size* (SD) 
% Matesf 



Total 



Total 



F 

Total 
Average 
Average 
Average 



11,736757 
853,819 
952.523 
0 

13,543,099 
0 

2.20 
0.16 
0.18 
0 

2.54 
0 

2.96 
0.22 
0.24 
0 

3.42 

1,951 bp 
6.10% 
74.50 



881,290 
1,046,815 
1,498,607 
10,894,467 
0 

1.40 

I. 17 
0.20 
0.28 
2.04 

0 

II. 26 
1.33 
1.58 
2.26 

16.43 

10.800 bp 
8.10% 
80.80 



0 
0 
0 

2,834,287 
0.52 
0.01 
0 
0 
0 

0.53 
18.39 
0.44 
0 
0 
0 

18.84 

50,715 bp 
14.90% 
75.60 



2,767.357 ■ 
19,271.442 
1,735,109 
1.999,338 
1,498,607 
27,271.853 
0.52 
. 3.61 
0.32 
0.37 
0.28 
5.11 
18.39 
14.67 
1.54 
1.82 
2.26 
38.68 



Total number of 
base pairs 



1,502,674,851 
10,464,393,006 
942,164,187 
1,085,640,534 
813,743.601 
14,808,616,179 



was 



'Insert size and SD are calculated from assembly of mates on confgs. 



t% Mates is based on laboratory tracking of sequencing runs. 
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nome, and even a modest error rate can 
reduce the effectiveness of assembly. In 
addition, maintaining the validity of mate- 
pair information is absolutely critical for. 
the algorithms described below. Procedural 
controls were established for maintaining 
the validity of sequence mate-pairs as se- 
quencing reactions proceeded through the 
process, including strict rules built into the 
LIMS. The accuracy of sequence data pro- 
duced by the Celera process was validated 
in the course of the Drosophila genome 
project {26). By collecting data for the 
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entire human genome in a single facility, 
we were able to ensure uniform quality 
standards and the cost advantages associat- 
ed with automation, an economy of scale, 
and process consistency. 

2 Genome Assembly Strategy and 
Characterization 

Summary. We describe in this section the two 
approaches that we used to assemble the ge- , 
nome. One method involves the computational 
combination of all sequence reads with shred- 
ded data from GenBank to generate an indepen- 



dent, nonbiased view of the genome. The sec- 
ond approach involves clustering all of the frag- 
ments to a region or chromosome on the basis 
of mapping information. The clustered data 
were then shredded and subjected to computa- 
tional assembly. Both approaches provided es- 
sentially the same reconstruction of assembled 
DNA sequence with proper order and orienta- 
tion. The second method provided slightly 
greater sequence coverage (fewer gaps) and 
was the principal sequence used for the analysis 
.phase. In addition, we document the complete- 
ness and correctness of this assembly process 
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Fie. 2. Flow diagram for sequencing pipel.ne. Samples are recerved. 
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process has defined inputS and outputs with the capabil.ty to exchange 
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quality control measures, and responsible part.es are ind.cated and are 
described further in the text. 
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and provide a comparison to the public genome 
sequence, which was reconstructed largely by 
an independent BAC-by-BAC approach- Our 
assemblies effectively covered the euchromatic 
regions of the human chromosomes. More than 
90% of the genome was in scaffold assemblies 
of 100,000 bp or greater, and 25% of the ge- 
nome was in scaffolds of 10 million bp or 
larger. 



Shotgun sequence assembly is a classic 
example of an inverse problem: given a set 
of reads randomly sampled from a target 
sequence, reconstruct the order and the po- 
sition of those reads in the target. Genome 
assembly algorithms developed for Dro- 
sophila have now been extended to assemble 
the -25-fold larger human genome. Celera as- 
semblies consist of a set of contigs that are 
ordered and oriented into scaffolds that are then 
mapped to chromosomal locations by using 
known markers. The contigs consist of a col- 
lection of overlapping sequence reads that pro- 
vide a consensus reconstruction for a contigu- 
ous interval of the genome. Mate pairs are a 
central component of the assembly strategy. 
They are used to produce scaffolds in which the 
size of gaps between consecutive contigs is 
known with reasonable precision. This is ac- 
complished by observing that a pair of reads 
one of which is in one contig, and the other of 
which is in another, implies an orientation and 
distance between the two contigs (Fig. 3). Fi- 
nally, our assemblies did not incorporate all 
reads' into the final set of reported scaffolds. 
This set of unincorporated reads is termed 
"chaff," and typically consisted of reads from 
within highly repetitive regions, data from other 
organisms introduced through various routes as 
found in many genome projects, and data of 
poor quality or with untrimmed vector. 
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2.1 Assembly data sets 

We used two independent sets of data for our 
assemblies. The first was a random shotgun 
data set of 27.27 million reads of average length 
543 bp produced at Celera. This consisted 
largely of mate-pair reads from 16 libraries 
constructed from DNA samples taken from five 
different donors. Libraries with insert sizes of 2, 
10 and 50 kbp were used. By looking at how 
mate pairs from a library were positioned in 
known sequenced stretches of the genome, we . 
were able to characterize the range of insert 
: sizes in each library and determine a mean and 
standard deviation. Table 1 details the number 
of reads, sequencing coverage, and clone cov- 
erage achieved by the data set. The clone cov- 
erage is the coverage of the genome in cloned 
DNA, considering the entire insert of each 
clone that has sequence from both ends. The 
clone coverage provides a measure of the 
amount of physical DNA coverage of the ge- 
nome. Assuming a genome size of 2.9 Gbp, the 
Celera trimmed sequences gave a 5.1 X cover- 
age of the genome, and clone coverage was 
3 42X, 16.40X, and 18.84X for the 2-, 10-, and 
50-kbp libraries, respectively, for a total of 
38.7X clone coverage. 

The second data set was from the publicly 
funded Human Genome Project (PFP) and is 
primarily derived from BAC clones (30). The 
BAC data input to the assemblies came from a 
download of GenBank on 1 September 2000 
(Table 2) totaling 4443.3 Mbp of sequence 
The data for each BAC is deposited at one of 
four levels of completion. Phase 0 data are a set 
- of generally unassembled sequencing reads 
from a very light shotgun of the BAC, typically 
less than IX. Phase 1 data are unordered as- 
semblies of contigs, which we call BAC contigs 
or bactigs. Phase 2 data are ordered assemblies 
of bactigs. Phase 3 data are complete BAC 
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sequences. In the past 2 years the PFP has 
focused on a product of lower quality and com- 
pleteness, but on a faster time-course, by con- 
centrating on the production of Phase 1 data 
from a 3X to 4X light-shotgun of each BAC 
clone. 

We screened the bactig sequences for con- 
taminants by using the BLAST algorithm 
against three data sets: (i) vector sequences 
in Univec core (38), filtered for a 25-bp 
match at 98% sequence identity at the ends 
of the sequence and a 30-bp match internal 
to the sequence; "(ii) the nonhuman portion 
of the High Throughput Genomic (HTG) 
Seqences division of GenBank (39), fil- 
tered at 200 bp at 98%; and (iii) the non- 
redundant nucleotide sequences from Gen- 
Bank without primate and human virus en- 
tries, filtered at 200 bp at 98%. Whenever 
25 bp or more of vector was found within 
50 bp of the end of a contig, the tip up to 
the matching vector was excised. Under 
these criteria we removed 2.6 Mbp of pos- 
sible contaminant and vector from the 
Phase 3 data, 61.0 Mbp from the Phase 1 
and 2 data, and 16.1 Mbp from the Phase 0 
data (Table 2). This left us with*a total of 
4363.7 Mbp of PFP sequence data 20% 
finished, 75% rough-draft (Phase 1 and 2), 
and 5% single sequencing reads (Phase 0). 
An additional 104,018 BAC end-sequence 
mate pairs were also downloaded and in- 
cluded in the data sets for both assembly 
processes (18). 

2.2 Assembly strategies 

Two different approaches to assembly were 
pursued. The first was a whole-genome as- 
sembly process that used Celera data and the 
PFP data in the form of additional synthetic 
shotgun data, and the second was a compart- 
mentalized assembly process that first parti- 
tioned the Celera and PFP data into sets 
localized to large chromosomal segments and 
then performed ab initio shotgun assembly on 
each set. Figure 4 gives a schematic of the 
overall process flow. 

For the whole-genome assembly, the PFP 
data was first disassembled or "shredded" into a 
synthetic shotgun data set of 550-bp reads that 
form a perfect 2X covering of.the bactigs. This 
resulted in 16.05 million "faux" reads that were 
sufficient to cover the genome 2.96X because 
of redundancy in the BAC data set, without 
incorporating the biases inherent in the PFP 
assembly process. The combined data set of 
43.32 million reads (8X), and all associated 
mate-pair information, were then subjected to 
our whole-genome assembly algorithm to pro- 
duce a reconstruction of the genome. Neither 
the location of a BAC in the genome nor its 
assembly of bactigs was used in this process. 
Bactigs were shredded into reads because we 
found strong evidence that 2.13% of them were 
misassembled (40). Furthermore, BAC location 
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information was ignored because some BACs 
were not correctly placed on the PFP physical 
map and because we found strong evidence that 



Table 2. CenBank data input into assembly. 
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at least 22% of the BACs contained sequence 
data that were not part of the given BAC {41\ 
possibly as a result of sample-tracking errors 
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279,477 


575.366 


1,616,665 


931 


9,478 


3,021 


21,015 


258.943 


409,628 


209,930.983 


3,360,047,574 


1,655.293 


2,438,575 


14,918.135 


16.3U664 


811 


8,203 



134,516 

1,300 
1,300 
164.214,395 
8,287 
469,487 

126.319 

363 
363 

49.017,104 
4,960 
485,137 

135,033 

754 
754 

60,975.328 
7.274 
118,387 



Number of accession records 
Number of contigs 
Total base pairs . 
Total vector masked (bp) 
Total contaminant masked 

. M . ;. . kft , 

.Average contig length (bpj 
Number of accession records 
Number of contigs 
Total base pairs 
Total vector masked (bp) 
Total contaminant masked 
(bp) 

Average contig length (bp) 
Number of accession records 
Number of contigs 
Total base pairs 
Total vector masked (bp) 
Total contaminant masked 

M 

Average contig length (bp) 

Number of accession records 
Number of contigs 
Total base pairs 
Total vector masked (bp) 
Total contaminant masked 

(bp) . 
Average contig length (bp) 

Number of accession records 
Number of contigs 
Total base pairs 
Total vector masked (bp) 
Total contaminant masked (bp) 
Average contig length (bp) 
Number of accession records 
Number of contigs 
Total base pairs 
Total vector masked (bp) 
Total contaminant masked (bp) 
Average contig length (bp) 
Number of accession records 
Number of contigs 
Total base pairs 
Total vector masked (bp) 
Total contaminant masked 
(bp) 

Average contig length (bp) 
Number of accession records 
Number of contigs 
. ] Total base pairs 

Total vector masked (bp) , 
Total contaminant masked 
(bp) 

Average contig length (bp) b , 

W centers coning at least 

Cenomanalyse Gesellschaft fuer Biotechnologische F orschung i ^Cenorn e £P ^ ' f Medicine ; Lawrence 
ChineTe Academy of Science, Mut. f 

Uvermore National Laboratory; Cold Spring ^^ r ^^. Slanford University; The Institute for Genomic 
Molekutare. Geneti* Japan Science and T f™ l f5^ 

Research; The Institute of Physical and Chemical Research. c ^ Ba ^ n ^ 825 J ses contr ibuted by all centers were 
Southwestern Medical Center. University of Wash.ngton. w 
bedded Into faux reads resulting In 236X coverage of the genome. 



80,867 

300 
300 

20,093,926 
2,371 
27.781 
66,978 

2,599 
2,599 
246,118,000 
25.054 
374,561 
94.697 

3,458 
3,458 
246,474,157 
32,136 
1,791,849 

71,277 

9,137 
9.137 
835,722,268 
82,284 
3.365.230 



(see below). In short, we performed a true, ab 
initio whole-genome assembly in which 
took the expedient of deriving additional se- 
quence coverage, but not mate pairs, assembled 
bactigs, or genome locality, from some exter- 
nally generated data. 

In the compartmentalized, shotgun assembly 
(CSA), Celera and PFP data were partitioned 
into the largest possible chromosomal segments 
or "components" mat could be determined with 
confidence, and then shotgun assembly was ap- 
plied to each partitioned subset wherein the 
bactig data were again shredded into faux reads 
to ensure an independent ab initio assembly or 
the component. By subsetting the data in (his 
way, the overall computational effort was re- 
duced and the effect of interchromosomal dupli- 
cations was ameliorated This also resulted in a 
reconstruction of the genome that was relatively 
independent of the whole-genome assembly re- 
sults so that the two assemblies could be com- 
pared for consistency. Inequality of the parti- 
tioning into components was crucial so that 
different genome regions were not mixed to- 
gether V/e constructed components from (i) the 
longest scaffolds of the sequence from each 
BAC and (ii) assembled scaffolds of data unique . 
to Celera's data set. The BAC assemblies were 
obtained by a combining assembler that used the 
bactigs and the 5X Celera data mapped to those 
bactigs as input. This effort was undertaken as 
an interim step solely because the more accurate 
and complete the scaffold for a given sequenc^ 
stretch, the more accurately one can tile these 
scaffolds into contiguous components on u»c 
basis of sequence overlap and mate-pair mlor- 
mation. We further visually inspected and tu- 
rated the scaffold tiling of the Wnc^o 
further increase its accuracy. For the final CS A 
assembly, all but the partitioning was cd 
and an independent, ab initio 
the sequence in each component was obtnmcc 
by applying our ^ whole-genome 
rithm to the partitioned, relevant Celera <wi 
It shredded, faux reads of the partitioned, al 
evant bactig data. 



2 3 Whole-genome assembly 
The algorithms used for whole-gcnj^i 
sembly (WGA) of the human genome wc 
enhancements to those used to produced 
sequence of the Drosophila genome report 

in detail in (28). „f,ninelM 

The WGA assembler consists of a _r P 
composed of five principal stoges: S cr 
Overlapper, Unitigger 
Resolver. respectively. The ***** 
and marks all microsatelhte repeats w " 
than a 6-b P element, and screen ou 

ing Alu. Line, and nbosomal DN A. * 
regions get searched for overlaps, 
screened regions do not get search* 
be part of an overlap that involves unscrt 
matching segments. 
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The Overlapper compares every read 
against every other read in search of complete 
end-to-end overlaps of at least 40 bp and with 
no more than 6% differences in the match, 
jecause all data are scrupulously vector- 
trimmed, the Overlapper can insist on com- 
plete overlap matches. Computing the set of 
all overlaps took roughly 10,000 CPU hours 
with a suite of four-processor Alpha SMPs 
with 4 gigabytes of RAM. This took 4 to 5 
days in elapsed time with 40 such rriaclunes . 
operating in parallel. .. . ■ j 

Every overlap computed above is statisti- 
cally a l-in-10 17 event and thus not a coinci- 
dental event. What makes assembly combi- 
natorially difficult is that while many over- 
laps are actually sampled from overlapping 
regions of the genome, and thus imply that 
the sequence reads should be assembled to- 
gether, even more overlaps are actually from 
two distinct copies of a low-copy repeated 
element not screened above, thus constituting 
an error if put together. We call the former 
"true overlaps" and the latter "repeat-induced 
overlaps." The assembler must avoid choos- 
ing repeat-induced overlaps, especially early 
in the process. _ 

We achieve this objective in the Unitig- 
ger. We first find all assemblies of reads that 
appear to be uncontested with respect to all 
other reads. We call the contigs formed from 
these subassemblies unitigs (for uniquely as- 
sembled contigs). Formally, these unitigs are 
the uncontested interval subgraphs of -the 
graph of all overlaps (42). Unfortunately, al- 
though empirically many of these assemblies 
are correct (and thus involve only true over- 
laps), some are in fact collections of reads 
from several copies of a repetitive element 
that have been overcollapsed into a single 
subassembly. However, the overcollapsed 
unitigs are easily identified because their av- 
erage coverage depth is too high to be con- 
sistent with the overall level of sequence 
coverage. We developed a simple statistical 
discriminator that gives the logarithm of the 
odds ratio that a unitig is composed of unique 
DNA or of a repeat consisting of two or more 
copies. The discriminator, set to a sufficiently 
stringent threshold, identifies a subset of the 
unitigs that we are certain are correct. In 
addition, a second, less stringent threshold 
identifies a subset of remaining unitigs very 
likely to be correctly assembled, of which we 
select those that will consistently scaffold 
(see below), and thus are again almost certain 
to be correct. We call the union of these two 
sets U-unitigs. Empirically, we found from a 
6X simulated shotgun of human chromosome 
22 that we get U-unitigs covering 98% of the 
stretches of unique DNA that .are >2 kbp 
long. We are further able to identify the 
boundary of the start of a repetitive element 
at the ends of a U-unitig and leverage this so 
that U-unitigs span more than 93% of all 
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singly interspersed Alu elements and other 
100-to 400-bp repetitive segments. 

The result of running the Unitigger was 
thus a set of correctly assembled subcontigs 
covering an estimated 73.6% of the human 
genome. The Scaffolder then proceeded to 
use mate-pair information to link these to- 
gether* into scaffolds. When there are two or 
more mate pairs that imply that a given pair 
of U-unitigs are at a certain distance and 
orientation with, respect to each other, the • 
probability ^ of this being wrong " is again " 
roughly 1 in 10 10 , assuming that mate pairs 
are false less than 2% of the time. Thus, one 
can with high confidence link together all 
U-unitigs that are linked by at least two 2- or 
10-kbp mate pairs producing intermediate- 
sized scaffolds that are then recursively 
linked together by confirming. 5 0-kbp mate 
pairs and BAC end sequences. This process 
yielded scaffolds that are on the order of 
megabase pairs in size with gaps between 
their contigs that generally correspond to re- 
petitive elements and occasionally to small 
sequencing gaps. These scaffolds reconstruct 
the majority of the unique sequence within a 
genome. 

For the Drosophila assembly, we engaged 
in a . three-stage repeat resolution strategy 
where each stage was progressively more 



5.1 1X Cetera Reads 
39X mate pairs 



aggressive and thus more likely to make a 
mistake. For the human assembly, we contin- 
ued to use the first "Rocks" substage where 
all unitigs with a good, but not definitive, 
discriminator score are placed in a scaffold 
gap. This was done with the condition that 
two or more mate pairs with one of their 
reads already in the scaffold unambiguously 
place the unitig in the given gap. We estimate 

. the. probability of inserting a unitig into an. 

: incorrect gap with this strategy to be less than 
10" 7 based on a probabilistic analysis. 

We revised the ensuing "Stones" substage 
of the human assembly, making it more like 
the mechanism suggested in our earlier work 
(43). For each gap, every read R that is placed 
in the gap by virtue of its mated pair M being 
in a contig of the scaffold and implying R's 
placement is collected. Celera's mate-pairing 
information is correct more than 99% of the 
time. Thus, almost every, but not all, of the 
reads in the set belong in the gap, and when 
a read does not belong it rarely agrees with 
the remainder of the reads. Therefore, we 
simply assemble this set of reads within the 
. gap, eliminating any reads that conflict with 
the assembly. This operation proved much 
more reliable than the one it replaced for the 
Drosophila assembly; in the assembly of a 
simulated shotgun data set of human chromo- 
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some 22, all stones were placed correctly. 

The final method of resolving gaps is to 
fill them with assembled BAC data that cover 
the gap. We call this external gap "walking." 
We did not include the very aggressive "Peb- 
bles" substage described in our Drosophila 
work, which made enough mistakes so as to 
produce repeat reconstructions for long inter- . 
spersed elements whose quality was . only 
99.62% correct. We decided that for the hu- 
man genome it was philosophically better not 
to introduce a step that was certain to produce 
less than 99.99% accuracy. The cost was a 
somewhat larger number of gaps of some- 
what larger size. 

- At the final stage of the assembly process, 
and also at several intermediate points, a 
consensus sequence of every contig is pro- 
duced. Our algorithm is driven by the princi- 
ple of maximum parsimony, with quality- 
value-weighted measures for evaluating each 
base. The net effect is a Bayesian estimate of 
the correct base to report at each position. 
Consensus generation uses Celera data when- 
ever it is present. In the event that no Celera 
data cover a given region, the BAC data 
sequence is used. 

A key element of achieving a WGA of the 
human genome was to parallelize the Overlap- 
per and the central consensus sequence-con- 
structing subroutines. In addition, memory was 
a real issue — a straightforward application of 
the software we had built for Drosophila would 
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have required a computer with a 600-gigabyte 
RAM. By making the Overlapper and Unitigger 
incremental, we were able to achieve the same 
computation with a maximum of instantaneous 
usage of 28 gigabytes of RAM. Moreover, the 
incremental nature of the first three stages al- 
lowed us to continually update the state of this 
part of the computation as data were delivered 
and then perform a 7-day run to complete Scaf- 
folding and Repeat Resolution whenever de- 
sired. For our assembly operations, the total 
compute infrastructure consists of 10 four-pro- 
cessor SMPs with 4 gigabytes of memory per 
cluster (Compaq's ES40, Regatta) and a 16- 
processor NUMA machine with 64 gigabytes 
of memory (Compaq's GS160, Wildfire). The 
total compute for a run of the assembler was 
roughly 20,000 CPU hours. 

The assembly of Celera's data, together 
with the shredded bactig data, produced a set of 
scaffolds totaling 2.848 Gbp in span and con- 
sisting of 2.586 Gbp of sequence. The chaff, or 
set of reads not incorporated in the assembly, 
numbered 11.27 million (26%), which is con- 
sistent with our experience for Drosophila. 
More than 84% of the genome was covered by 
scaffolds ;>100 kbp long, and these averaged 
91% sequence and 9% gaps with a total of 
2.297 Gbp of sequence. There were a total of 
93,857 gaps among the 1637 scaffolds >100 
kbp. The average scaffold size was 1.5 Mbp, 
the average contig size was 24.06 kbp, arid the 
average gap size was 2.43 kbp, where the dis- 
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tribution of each was essentially exponential ; 
More than 50% of all gaps were less than 50C 1 
bp long, >62% of all gaps were less than 1 kbp ; 
long, and no gap was >100 kbp long. Similar- | 
ly, more than 65% of the sequence is in contigs 
>30 kbp, more than 31% is in contigs >100 
kbp, and the largest contig was 1.22 Mbp long. 
Table 3 gives detailed summary statistics for 
the -.structure of this assembly with a direct 
comparison to the compartmentalized shotgun 
assembly.. 

2.4 Compartmentalized shotgun 
assembly 

In addition to the WGA approach, we pur- 
sued a localized assembly approach that was 
intended to subdivide the genome into seg- 
ments, each of which could be shotgun as- 
sembled individually. We expected that this 
would help in resolution of large interchro- 
mosomal duplications and improve the statis- 
tics for calculating U-unitigs. The compart- 
mentalized assembly process involved clus- 
tering Celera reads and bactigs into large, 
multiple megabase regions of the genome, 
and then running the WGA assembler on the 
Celera data and shredded, faux reads ob- 
tained from the bactig data. 

The first phase of the CS A strategy was to 
separate Celera reads into those that matched 
the BAC contigs for a particular PFP BAC 
entry, and those that did not match any public 
data. Such matches must be guaranteed to 



Table 3. Scaffold statistics for whole-genome and compartmentalized shotgun assemblies. 



Scaffold size 



No. of bp In scaffolds 

(including intrascaffold gaps) 
No. of bp in contigs 
No. of scaffolds 
No. of contigs 
No. of gaps 
No. of gaps :£1 kbp 
Average scaffold size (bp) 
Average contig size (bp) 
Average intrascaffold gap size 

(bp) 

Largest contig (bp) 
% of total contigs 

No. of bp In scaffolds 

(including intrascaffold gaps) 
No. of bp in contigs 
No. of scaffolds 
No. of contigs 
No. of gaps 
No. of gaps :S1 kbp 
Average scaffold size (bp) 
Average contig size (bp) 
Average intrascaffold gap size 

(bp) 

Largest contig (bp) 
% of total contigs 



All 


>30 kbp 


>100 kbp 


>500 kbp 


>1000 kbp 




Compartmentalized shotgun assembly 






2.905,568.203 


2,748,892,430 


2,700,489,906 


2,489,357,260 


2,248,689,128 


2.653,979.733 


2,524.251,302 


2,491,538,372 


2,320,648.201 


2,106,521,902 


53,591 


2.845 


1,935 


1,060 


721 


170,033 


112,207 


107,199 


93,138 


82.009 


116,442 


109,362 


105,264 


92,078 


81,288 


72,091 


69,175 


67.289 


59,915 


53,354 


54.217 


966,219 


1,395.602 


2,348,450 


3,118,848 


15.609 


22,496 


23,242 


24,916 


25,686 


2,161 


2,054 


1,985 


1,832 


1,749 


1,988,321 


1,988,321 


1.988,321 


1,988,321 


1,988,321 


100 


95 


94 


87 


79 




Whole-genome assembly 


i 






2,847,890.390 


2,574,792,618 


2,525,334,447 


2,328,535,466 


* 2,140,943,032 



2.586,634,108 
118,968 
221.036 
102,068 
62,356 
23,938 
11,702 
2.560 

1,224,073 
100 



2,334,343.339 
2,507 
99,189 
96,682 
60,343 
1,027,041 
23.534 
2,487 

1,224.073 
90 



2.297,678,935 
1.637 
95,494 
93,857 
59,156 
1,542.660 
24,061 
2,426 

1,224,073 
89 



2,143,002,184 
818 
84,641 
' 83,823 
54,079 
2,846,620 
25,319 
2,213 

1,224.073 
83 



1,983,305,432 
554 
76,285 
75.731 
49.592 
3.864.518 
25,999 
2,082 

1,224,073 
77 
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properly place a Celera read, so all reads were 
first masked against a library of common 
repetitive elements, and only matches of at 
least 40 bp to unmasked portions . of the read 

•tituted a hit. Of Celera's 27.27 million 
5, 20.76 million matched a bactig and 
another 0.62 million reads, which did not 
have any matches, were nonetheless identi- 
fied as belonging in the region of the bactig's 
BAC because their mate matched the bactig. 
Of the remaining reads, 2.92. million were, 
completely screened out and so could not be 
matched, but the other 2.97 million reads had 
unmasked sequence totaling 1.189 Gbp that 
were not found in the GenBank data set. 
Because the Celera data are 5. 1 1 X redundant, 
we estimate that 240 Mbp of unique Celera 
sequence is not in the GenBank data set. 

In the next step of the CSA process, a 
combining assembler took the relevant 5X 
Celera reads and bactigs for a BAC entry, and 
produced an assembly of the combined data 
for that locale. These high-quality sequence 
reconstructions were a transient result whose 
utility was simply to provide more reliable 
information for the purposes of their tiling 
into sets of overlapping and adjacent scaffold 
sequences in the next step. In outline, the 
combining assembler first examines the set of 
matching Celera reads to determine if there 
are excessive pileups indicative of un- 
screened repetitive elements. Wherever these 
occur, reads in the repeat region whose mates 
^^ve not been mapped to consistent positions 
^Bb removed. Then all sets of mate pairs that 
consistently imply the same relative position 
of two bactigs are bundled into a link and 
weighted according to the number of mates in 
the bundle. A "greedy" strategy then attempts 
to order the bactigs by selecting bundles of 
mate-pairs in order of their weight. A selected 
mate-pair bundle can tie together two forma- 
tive scaffolds. It is incorporated to form a 
single scaffold only if it is consistent with the 
majority of links between contigs of the scaf- 
fold. Once scaffolding is complete, gaps are 
filled by the "Stones" strategy described 
above for the WGA assembler. 

The GenBank data for the Phase 1 and 2 
BACs consisted of an average of 19.8 bactigs 
per BAC of average size 8099 bp. Applica- 
tion of the combining assembler resulted in 
individual Celera BAC assemblies being put 
together into an average of 1.83 scaffolds 
(median of 1 scaffold) consisting of an aver- 
age of 8.57 contigs of average size 18,973 bp. 
In addition to defining order and orientation 
of the sequence fragments, there were 57% 
fewer gaps in the combined result. For Phase 
0 data, the average GenBank entry consisted 
of 91.52 reads of average length 784 bp. 
Application of the combining assembler re- 

• suited in an average of 54.8 scaffolds consist- 
ing of an average of 58.1 contigs of average 
size 873 bp. Basically, some small amount of 



assembly took place, but not enough Celera 
data were matched to truly assemble the 0.5X 
to IX data set represented by the typical 
Phase 0 BACs. The combining 'assembler 
was also applied to the Phase 3 BACs for 
SNP identification, confirmation of assem- 
bly, and localization of the Celera reads. The 
phase 0 data suggest that a combined whole- 
genome shotgun data set and IX light-shot- 
gun of BACs will not yield good assembly of 
BAC regions; at least 3 X light-shotgun of 
each BAC is needed. . .. 

The 5.89 million Celera fragments not 
matching the GenBank data were assembled 
with our whole-genome assembler. The as- 
sembly resulted in a set of scaffolds totaling 
442 Mbp in span and consisting of 326 Mbp 
of sequence. More than 20% of the scaffolds 
were >5 kbp long, and these averaged 63% 
sequence and 27% gaps with a total of 302 
Mbp of sequence. All scaffolds >5 kbp were 
forwarded along with all scaffolds produced 
by the combining assembler to the subse- 
quent tiling phase. 

At this stage, we typically had one or two 
' scaffolds for every BAC region constituting 
at least 95% of the relevant sequence, and a 
collection of disjoint Celera-unique scaffolds. 
The next step in developing the genome com- 
ponents was to determine the order and over- 
lap tiling of these BAC and Celera-unique 
scaffolds across the genome. For this, we 
used Celera's 50-kbp mate-pairs information, 
and B AC-end pairs (18) and sequence tagged 
. site (STS) markers (44) to provide long- 
range guidance and chromosome separation. 
Given the relatively manageable number of 
scaffolds, we chose not to produce this tiling 
in a fully automated manner, but to compute 
an initial tiling with a good heuristic and then 
use human curators to resolve discrepancies 
or missed join opportunities. To this end, we 
developed a graphical user interface that dis- 
played the graph of tiling overlaps and the 
evidence for each. A human curator could 
then explore the implication of mapped STS 
data, dot-plots of sequence overlap, and a 
visual display of the mate-pair evidence sup- 
porting a given choice. The result of this 
process was a collection of "components," 
where each component was a tiled set of 
BAC and Celera-unique scaffolds that had 
been curator-approved. The process resulted 
in 3845 components with an estimated span 
of 2.922 Gbp. 

In order to generate the final CSA, we 
assembled each component with the WGA 
algorithm. As was done in the WGA process, 
the bactig data were shredded into a synthetic 
2X shotgun data set in order to give the 
assembler the freedom to independently as- 
semble the data. By using faux reads rather 
than bactigs, the assembly algorithm could 
correct errors in the assembly of bactigs and 
remove chimeric content in a PFP data entry. 



Chimeric or contaminating sequence (from 
another part of the genome) would not be 
incorporated into the reassembly of the com- 
ponent because it did not belong there. In 
effect, the previous steps in the CSA process 
served only to bring together Celera frag- 
ments and PFP data relevant to a large con- 
tiguous segment of the genome, wherein we 
applied the assembler used for WGA to pro- 
duce an ab initio assembly of the region. 

WGA assembly of the components result- 
ed in a set of scaffolds totaling 2;906. Gbp in , 
span and consisting of .2.654 Gbp of se- 
quence. The chaff, of set of reads not incor- 
porated into the assembly, numbered 6.17 
million, or 22%. More than 90.0% of the 
genome was covered by scaffolds spanning 
>100 kbp long, and these averaged 92.2% 
sequence and 7.8% gaps with a total of 2.492 
Gbp of sequence. There were a total of 
105,264 gaps among the 107,199 contigs that 
belong to the 1940 scaffolds spanning >100 
kbp. The average scaffold size was 1.4 Mbp, 
the average contig size was 23.24 kbp, and 
the average gap size was 2.0 kbp where each 
distribution of sizes was exponential As 
such, averages tend to be underrepresentative 
of the majority of the data. Figure 5 shows a 
histogram of the bases in scaffolds of various 
size ranges. Consider also that more than 
49% of all gaps were <500 bp long, more 
than 62% of all gaps were <1 kbp, and all 
gaps are <100 kbp long. Similarly, more than 
73% of the sequence is in contigs > 30 kbp, 
more than 49% is in contigs >100.kbp, and 
the largest contig was 1.99 Mbp long. Table 3 
provides summary statistics for the structure 
of this assembly with a direct comparison to 
the WGA assembly. 



2.5 Comparison of the WGA and CSA 
scaffolds 

Having obtained two assemblies of the hu- 
man genome via independent computational 
processes (WGA and CSA), we compared 
scaffolds from the two assemblies as another 
means of investigating their completeness, 
consistency, and contiguity. From each as- 
sembly, a set of reference scaffolds contain- 
ing at least 1000 fragments (Celera sequenc- 
ing reads or bactig shreds) was obtained; this 
amounted to 2218 WGA scaffolds and 1717 
CSA scaffolds, for a total of 2.087 Gbp and 
2.474 Gbp. The sequence of each reference 
scaffold was compared to the sequence of all 
scaffolds from the other assembly with which 
it shared at least 20 fragments or at least 20% 
of the fragments of the smaller scaffold. For 
each such comparison, all matches of at least 
200 bp with at most 2% mismatch were 
tabulated. 

From this tabulation, we estimated the 
amount of unique sequence in each assembly 
in two ways. The first was to determine the 
number of bases of each assembly that were 
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not covered by a matching segment in the 
other assembly. Some 82.5 Mbp of the WGA 
(3.95%) was not covered by the CSA, where- 
as 204.5 Mbp (8.26%) of the CSA was not 
covered by the WGA. This estimate did not 
require any consistency of the assemblies or 
any uniqueness of the matching segments. 
Thus, another analysis was conducted in 
which matches of less than 1 kbp between a 
pair of scaffolds were excluded unless they 
were confirmed by other matches having a : 
consistent order and orientation. This gives 
some measure of consistent coverage: 1.982 
Gbp (95.00%) of the WGA is covered by the 
CSA, and 2.169 Gbp (87.69%) of the CSA is 
covered by the WGA by this more stringent 

measure, v 

The comparison of WGA to CSA also 
permitted evaluation of scaffolds for structur- 
al inconsistencies. We looked for instances in 
which a large section of a scaffold from one 
assembly matched only one scaffold from the 
other assembly, but failed to match over the 
full length of the overlap implied by the 
matching segments. An initial set of candi- 
dates was identified automatically, and then 
each candidate was inspected by hand. From 
this process, we identified 31 instances in 
which the assemblies appear to disagree in a 
nonlocal fashion. These cases are being fur- 
ther evaluated to determine which assembly 
is in error and why. 

In addition, we evaluated local inconsis- 
tencies of order or orientation. The following 
results exclude cases in which one contig in 
one assembly corresponds to more than one 
overlapping contig in the other assembly (as 
long as the order and orientation of the latter 
agrees with the positions they match in the 
former). Most of these small rearrangements 
involved segments on the order of hundreds 
of base pairs and rarely >1 kbp. We found a 
total of 295 kbp (0.012%) in the CSA assem- 
blies that were locally inconsistent with the 
WGA assemblies, whereas 2.108 Mbp 
(0.11%) in the WGA assembly were incon- 
sistent with the CSA assembly. 
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The CSA assembly was a few percentage 
points better in terms of coverage and slightly 
more consistent than the WGA, because it 
was in effect performing a few thousand shot- 
gun assemblies of megabase-sized problems, 
whereas the WGA is performing a shotgun 
assembly of a gigabase-sized problem. When 
one considers the increase of two-and-a-half 
orders of magnitude in problem size, the in- 
. formation loss between the two is remarkably 
small. Because CSA was logistically easier to 
deliver and the better of the two results avail- 
able at the time when downstream analyses 
needed to be begun, all subsequent analysis 
was performed on this assembly. * 

2.6 Mapping scaffolds to the genome 

The final step in assembling the genome was to 
order and orient the scaffolds on the chromo- 
somes. We first grouped scaffolds together on 
the basis of their order in the components from 
CSA. These grouped scaffolds were reordered 
by examining residual mate-pairing data be- 
tween the scaffolds. We next mapped the scaf- 
fold groups onto the chromosome using physi- 
. cal mapping data. This step depends on having 
reliable high-resolution map information such 
that each scaffold will overlap multiple mark- 
ers. There are two genome-wide types of map 
. information available: high-density STS maps 
and fingerprint maps of BAC clones developed 
at Washington University (45). Among the ge- 
' nome-wide STS maps, GeneMap99 (GM99) 
has the most markers and therefore was most 
useful for mapping scaffolds. The two different 
mapping approaches are complementary to one 
another. The fingerprint maps should have bet- 
ter local order because they were built by com- 
parison of overlapping BAC clones. On the 
other hand, GM99 should have a more reliable 
long-range order, because the framework mark- 
ers were derived from well-validated genetic 
maps. Both types of maps were used as a 
reference for human curation of the compo- 
nents that were the input to the regional assem- 
bly, but they did not determine the order of 
sequences produced by the assembler. 
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Fig. 5. Distribution of scaffold sizes of the CSA. For each range of scaffold sixes, the percent of total 
sequence Is indicated. 



In order to determine the effectiveness of 
the fingerprint maps and GM99 for mapping 
scaffolds, we first examined the reliability of 
these maps by comparison with large scaf- 
folds. Only 1% of the STS markers on the 10 
largest scaffolds (those >9 Mbp) were 
mapped on a different chromosome on 
GM99. Two percent of the STS markers dis- 
agreed in position by more than five frame- 
work bins. However, for the fingerprint 
maps, a 2% chromosome discrepancy was 
observed, and on average 23.8% of BAC 
locations in the scaffold sequence disagreed 
with fingerprint map placement by more than 
five BACs. When further examining the 
source of discrepancy, it was found that most 
of the discrepancy came from 4 of the 10. 
scaffolds, indicating this there is variation in 
the quality of either the map or the scaffolds. 
All four scaffolds were assembled, as well as 
the other six, as judged by clone coverage 
analysis, and showed the same low discrep- 
ancy rate to GM99, and thus we concluded 
that the fingerprint map global order in these 
cases was not reliable. Smaller scaffolds had 
a higher discordance rate with GM99 (4.21% 
of STSs were discordant by more than five 
framework bins), but a lower discordance rate 
with the fingerprint maps (11% of BACs 
disagreed with fingerprint maps by more than 
• five BACs). This observation agrees with the 
■ clone coverage analysis (46) that Celera scaf- 
fold construction was better supported by 
long-range mate pairs in larger scaffolds than 
in small scaffolds. 

We created two orderings of Celera scaf- 
folds on the basis of the markers (BAC or 
STS) on these maps. Where the order of 
scaffolds agreed between GM99 and the 
WashU BAC map, we had a high degree of 
confidence that that order was correct; these 
scaffolds were termed "anchor scaffolds." 
Only scaffolds with a low overall discrepancy 
rate with both maps were considered anchor 
scaffolds. Scaffolds in GM99 bins were al- 
lowed to permute in their order to match 
WashU ordering, provided they did not vio- 
late their framework orders. Orientation of 
individual scaffolds was determined by the 
presence of multiple "mapped markers with . 
consistent order. Scaffolds with only one 
marker have insufficient information to as- 
sign orientation. We found 70.1% of the ge- 
nome in anchored scaffolds, more than 99% 
of which are also oriented (Table 4). Because 
GM99 is of lower resolution than the WashU 
map, a number of scaffolds without STS 
matches could be ordered relative to the an- 
chored scaffolds because they included se- 
quence from the same or adjacent BACs on 
the WashU map. On the other hand, because 
of occasional WashU global ordering dis- 
crepancies, a number of scaffolds determined 
to be lv unmappable" on the WashU map could 
be ordered relative to the anchored scaffolds 
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With GM99. These scaffolds were termed 
"ordered scaffolds." We found that 13.9% of 
the assembly could be ordered by these ad- 

Cial methods, and thus 84.0% of the ge- 
was ordered unambiguously, 
sxt, all scaffolds that could be placed, 
but not ordered, between anchors were as- 
signed to the interval between the anchored 
scaffolds and were deemed to be "bound- 
ed" between them. For example, small scaf- 
folds having STS hits from the same Gene- 
Map bin or hitting the same B AC cannot be 
ordered relative to each other, but can be 
assigned a placement boundary relative to 
other anchored or ordered scaffolds. The 
remaining scaffolds either had no localiza- 
tion information, conflicting information, 
or could only be assigned to a generic 
chromosome location. Using the above ap- 
proaches, -98% of the genome was an- 
chored, ordered, or bounded. 

Finally, we assigned a location for each 
scaffold placed on the chromosome by 
spreading out the scaffolds per chromosome. 
We assumed that the remaining unmapped 
scaffolds, constituting 2% of the genome, 
were distributed evenly across the genome. 
By dividing the sum of unmapped scaffold 
lengths with the sum of the number of 
mapped scaffolds, we arrived at an estimate 
of interscaffold gap of 1483 bp. This gap was 
used to separate all the scaffolds on each 
chromosome and to assign an offset in the 
^^kmosome. 

^^TOuring the scaffold-mapping effort, we en- 
countered many problems that resulted in addi- 
tional quality assessment and validation analy- 
sis. At least 978 (3% of 33,173) BACs were 
believed to have sequence data from more than 
one location in the genome (47). This is con- 
sistent with the bactig chimerism analysis re- 
ported above in the Assembly Strategies sec- 
tion. These BACs could not be assigned to 
unique positions within the CSA assembly and 
thus could not be used for ordering scaffolds. 
Likewise, it was not always possible to assign 
STSs to unique locations in the assembly be- 
cause of genome duplications, repetitive ele- 
ments, and pseudogenes. 

Because of the time required for an ex- 
haustive search for a perfect overlap, CSA 
generated 21,607 intrascaffold gaps where 
the mate-pair data suggested that the contigs 
should overlap, but no overlap was found. 
These gaps were defined as a fixed 50 bp in 
length and make up 18.6% of the total 
1 16,442 gaps in the CSA assembly. 

We chose not to use the order of exons 
implied in cDNA or EST data as a way of 
ordering scaffolds. The rationale for not us- 
ing this data was that doing so would have 
leased certain regions of the assembly by 
nging scaffolds to fit the transcript data 
nd made validation of both the assembly and 
gene definition processes more difficult. 
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2.7 Assembly and validation analysis 

We analyzed the assembly of the genome 
from the perspectives of completeness 
(amount of coverage of the genome) and . 
correctness (the structural accuracy of the 
order and orientation and the consensus se- 
quence of the assembly). 

Completeness. Completeness is defined as 
the percentage of the euchromatic sequence 
represented in the assembly. This cannot be 
known with absolute certainty until the eu- 
• chromatin , sequence has been .completed. 
However, it is possible to estimate complete- 
ness on the basis of (i) the estimated sizes of 
intrascaffold gaps; (ii) coverage of the two 
published chromosomes, 21 and 22 (48, 49); 
and (iii) analysis of the percentage of an 
independent set of random sequences (STS 
markers) contained in the assembly. The 
. whole-genome libraries contain heterochro- 
matic sequence and, although no attempt has 
been made to assemble it, there may be in- 
stances of unique sequence embedded in re- 
gions of heterochromatin as were observed in 
Drosophila (50, 51). 

The sequences of human chromosomes 21 
and 22 have been completed to high quality 
and published (48, 49), Although this se- 
quence served as input to the assembler, the 
finished sequence was shredded into a shot- 
gun data set so that the assembler had the 
opportunity to assemble it differently from 
the original sequence in the case of structural 
polymorphisms or assembly errors in the 
BAC data. In particular, the assembler must 
be able to resolve repetitive elements at the 
scale of components (generally muitimega- 
base in size), and so this comparison reveals 
the level to which the assembler resolves 
repeats. In certain areas, the assembly struc- 
ture differs from the published versions of 
chromosomes 21 and 22 (see below). The 
consequence of the flexibility to assemble 
"finished" sequence differently on the basis 
of Celera data resulted in an assembly with 
more segments than the chromosome 21 and 
22 sequences. We examined the reasons why 
there are more gaps in the Celera sequence 
than in chromosomes 21 and 22 and expect 
that they may be typical of gaps in other 
regions of the genome. In the Celera assem- 
bly, there are 25 scaffolds, each containing at 
least 10 kb of sequence, that collectively span 
94.3% of chromosome 21. Sixty-two scaf- 
folds span 95.7% of chromosome 22. The 
total length of the gaps remaining in the 
Celera assembly for these two chromosomes 
is 3.4 Mbp. These gap sequences were ana- 
lyzed by RepeatMasker and by searching 
against the entire genome assembly (52). 
About 50% of the gap sequence consisted of 
common repetitive elements identified by Re- 
peatMasker; more than half of the remainder 
was lower copy number repeat elements. 
A more global way of assessing complete- 



ness is to measure the content of an independent 
set of sequence data in the assembly. We com- 
pared 48,938 STS markers from Genemap99 
(51) to the scaffolds. Because these markers 
were not used in the assembly processes, they 
provided a truly independent measure of com- 
pleteness. ePCR (53) .and BLAST (54) were 
used to locate STSs on the assembled genome. 
We found 44,524 (91%) of the STSs in the 
mapped genome. An additional 2648 markers 
(5.4%) were found by searching the. unas- 
sembled data" or "chafT 1 We identified 1283 
STS markers (2.6%) not found in either Celera 
sequence or BAC data as of September 2000, 
raising the possibility that these markers may 
not be of human origin. If that were the case, 
the Celera assembled sequence would represent 
93.4% of the human genome and the unas- 
sembled data 5.5%, for a total of 98.9% cover- 
age. Similarly, we compared CSA against 
36,678 TNG radiation hybrid markers (55a) 
using the same method. We found that 32,371 
markers (88%) were located in the mapped 
CSA scaffolds, with 2055 markers (5.6%) 
found in the remainder. This gave a 94% cov- 
erage of the genome through another genome- 
wide survey. 

Correctness. Correctness is defined as the 
structural and sequence accuracy of the as- 
sembly. Because the source sequences for the 
- Celera data and the GenBank data are from 
different individuals, we could not directly 
compare the consensus sequence of the as- 

■ Table 4. Summary of scaffold mapping. Scaffolds 
were mapped to the genome with different levels 
of confidence (anchored scaffolds have the highest 
confidence; unmapped scaffolds have the lowest). 
Anchored scaffolds were consistently ordered by 
the WashU BAC map and CM99. Ordered scaf- 
folds were consistently ordered by at least one of 
the following: the WashU BAC map, GM99, or 
component tiling path. Bounded scaffolds had or- 
der conflicts between at least two of the external 
maps, but their placements were adjacent to a 
neighboring anchored or ordered scaffold. Un- 
mapped scaffolds had, at most, a chromosome 
assignment. The scaffold subcategories are given 
below each category. 



Mapped 
scaffold 
category 


Number 


Length (bp) 


% 
Total 
length 


Anchored 


1,526 


1,860,676,676 


70 


Oriented 


1.246 


1,852.088.645 


70 


Unoriented 


280 


8,588,031 


0.3 


Ordered 


2,001 


369,235.857 


14 


Oriented 


839 


329.633.166 


12 


Unoriented 


1.162 


- 39.602,691 


2 


Bounded 


38,241 


368,753.463 


14 


Oriented 


7,453 


274.536,424 


10 


Unoriented 


30,788 


94,217.039 


4 


Unmapped 


11,823 


55.313,737 


2 


Known 


281 


2.505.844 


0.1 


chromosome 








Unknown 


11,542 


52.807,893 


2 


chromosome 
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sembly against other finished sequence for 
deteniiining sequencing accuracy at the nu- 
cleotide level, although this has been done for 
identifying polymorphisms as described in 
Section 6. The accuracy of the consensus 
sequence is at least 99.96% on the basis of a 
statistical estimate derived from the quality 
values of the underlying reads. 

The structural consistency of the assembly 
can be measured by mate-pair analysis. In a 
correct assembly, every mated pair of se- 
quencing reads should be located on the con- 
sensus sequence with the correct separation 
. and orientation between the pairs. A pair is 
termed 'Valid" when' the reads are in the 
correct orientation! and the distance between 
them is within the mean ± 3 standard devi- 
ations of the distribution of insert sizes of the 
library from which the pair was sampled. A 
pair is termed "misoriented" when the reads 
are not correctly oriented, and is termed "mis- 
separated" when the distance between the 
reads is not in the correct range but the reads 
are correctly oriented. The mean ± the stan- 
dard deviation of each library used by the 
assembler was determined as described 
above. To validate these, we examined all 
reads mapped to the finished sequence of 
chromosome 21 (48) and determined how 
many incorrect mate pairs there were as a 
result of laboratory tracking errors and chi- 
merism (two different segments of the ge- 
nome cloned into the same plasmid), and how 
tight the distribution of insert sizes was for 



those that were correct (Table 5). The stan- 
dard deviations for all Celera libraries were 
quite small, less than 15% of the insert 

. length, with the exception of a few 50-kbp 
libraries. The 2- and 10-kbp libraries con- 
tained less than 2% invalid mate pairs, where- 
as the 50-kbp libraries were somewhat higher 
(-10%). Thus, although the mate-pair infor- 

. mation was not perfect, its accuracy was such 

- that measuring valid, misoriented, and mis- 
separated pairs with respect to a given assem- 
bly was deemed to be a reliable instrument 
for validation purposes, especially when sev- 

. eral mate pairs confirm or deny an ordering. 

The clone coverage of the genome was 
39 X, meaning that any given base, pair was, 
on average, contained in 39 clones or, equiv- 
alently, spanned by 39 mate-paired reads. 
Areas of low clone coverage or areas with a 
high proportion of invalid mate pairs would 
indicate potential assembly problems. We 
computed the coverage of each base in the 
assembly by valid mate pairs (Table 6). In . 
summary, for scaffolds >30 kbp in length, 
less than 1% of the Celera assembly was in 
regions of less than 3 X clone coverage. Thus, 
. more than 99% of the assembly, including 
order and orientation, is strongly supported 
by this measure alone. 

We examined the locations and number of 
.all misoriented and misseparated mates. In 

. addition to doing this analysis on the CSA 
assembly (as of 1 October 2000), we also 
performed a study of the PFP assembly as of 



5 September 2000 (30, 55b). In this latter 
case, Celera mate pairs had to be mapped to 
the PFP assembly. To avoid mapping errors 
due to high-fidelity repeats, the only pairs 
mapped were those for which both reads 
matched at only one location with less, than 
6% differences. A threshold was set such that 
sets of five .or more simultaneously invalid 
mate pairs indicated a potential breakpoint, 
where the construction of the two assemblies 
differed. The graphic comparison of the CSA 
chromosome 21 assembly with the published 
sequence (Fig. 6A) serves as a validation of 
this methodology. Blue tick marks in the 
panels indicate breakpoints. There were a 
similar (small) number of breakpoints on 
both chromosome sequences. The exception 
was 12 sets of scaffolds in the Celera assem- 
bly (a total of 3% of the chromosome length 
in 212 single-contig scaffolds) that were 
mapped to the wrong positions because they 
were too small to be mapped rejiably. Figures 

6 and 7 and Table 6 illustrate the mate-pair 
differences and breakpoints between the two 
assemblies. There was a higher percentage of 
misoriented and misseparated mate pairs in 
the large-insert libraries (50 kbp and BAC 
ends) than in the small-insert libraries in both 
assemblies (Table 6). The large-insert librar- 
ies are more likely to identify discrepancies 
simply because they span a larger segment of 
the genome. The graphic comparison be- 
tween the two assemblies for chromosome 8 
(Fig. 6, B and C) shows that there are many 



Table 5. Mate-pair validation. Celera fragment sequences were mapped to 
the published sequence of chromosome 21. Each mate pair uniquely 
mapped was evaluated for correct orientation and placement (number 



of mate pairs tested). If the two mates had incorrect relative orienta- 
tion or placement, they were considered invalid (number of invalid mate 
pairs). 



Library 
type 



Chromosome 21 



Genome 



Library 
no. 


Mean 
insert 
size 
(bp) 


SD 
(bp) 


SD/ 
mean 
(%) 


No. of 
mate 
pairs 

tested 


No. of 
invalid 
mate 
pairs 


% 
invalid 


1 


2,081 


106 


5.1 


3.642 


38 


1.0 


2 


1.913 


152 


7i9 


28,029 


413 


1.5 


3 


2,166 


175 


8.1 


4,405 


57 


1.3 


4 


11,385 


851 


7.5 


4.319 


80 


1.9 


5 


14,523 


1,875 


12.9 


7,355 


156 


2.1 


6 


.9,635 


1,035 


10.7 


5.573 


109 


2.0 


7 


10,223 


928 


9.1 


34.079 


399 


1.2 


8 


64,888 


2,747 


4.2 


16 


1 


6.3 


9 


53,410 


5,834 


10.9 


914 


170 


18.6 


10 


52.034 


7,312 


14.1 


5,871 


569 


9.7 


11 . 


52,282 


7,454 


14.3 


2.629 


213 


8.1 


12 


46,616 


. 7,378 


15.8 


2,153 


215 


10.0 


13 


55,788 


10,099 


18.1 


2.244 


249 


11.1 


14 


39,894 


5,019 


12.6 


199 


7 


3.5 


15 


48,931 


9,813 


20.1 


144 


10 


6.9 


16 


48,130 


4,232 


8.8 


195 


14 


7.2 


17 


106.027 


27,778 


26.2 


330 


16 


4.8 


18 


160,575 


54,973 


34.2 


155 


8 


5.2 


19 


164.155 


19.453 


11.9 


642 


44 


6.9 










102,894 


2,768 


2.7 












(mean = 2.7) 





Mean 
insert 
size (bp) 


SD 
(bp) 


SD/ 
mean 
(%) 


2,082 


90 


4.3 


1,923 


118 


6.1 


2,162 


158 


7.3 


11,370 - 


696 


6.1 


14,142 


1,402 


9.9 


9,606 


934 


9.7 


10,190 


777 


7.6 


65,500 


5,504 


8.4 


53.311 


5,546 


10.4 


51,498 . 


6,588 


12.8 


52,282 


7.454 


14.3 


45,418 


9,068 


20.0 


53,062 


10,893 


20.5 


36.838 


9,988 


27.1 


47,845 


4,774 


10.0 


47,924 


4,581 


9.6 


152,000 


26.600 


17.5 


161,750 


27,000 


16.7 


176.500 


19,500 


11.05 



2 kbp 
10 kbp 

50 kbp 



BES 



Sum 
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gene boundaries. During this process, multiple 
hits to the same region were collapsed to a 
coherent set of data by tracking the coverage of 
a region. For example, if a group of bases was 
represented by multiple overlapping ESTs, the 
union of these regions matched by the set of 
ESTs on the scaffold was marked as being 
supported by EST evidence. This resulted in a 
series of "gene bins;' each of which was be- 
lieved to contain a single gene. One weakness of 
this initial implementation of the algorithm was 
in predicting gene boundaries in regions of tan- 
demly duplicated genes. Gene clusters frequent- 
ly resulted in homologous neighboring genes 
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being joined together, resulting in an annotation 
that artificially concatenated these gene models. 

Next, known genes (those with exact match- 
es of a ftill-length cDNA sequence to the ge- 
nome) were identified, and the region corre- 
sponding to the cDNA was annotated as a 
predicted transcript. A subset of the curat- 
ed human gene set RefSeq from the Nation- 
al Center for Biotechnology Information 
(NCBI) was included as a data set searched in 
the computational pipeline. If a RefSeq tran- 
script matched the genome assembly for at least 
50% of its length at >92% identity, then the 
SDM4 (6*3) alignment of the RefSeq transcript to 



the region of the genome under analysis was 
promoted to the status of an Otto annotation. 
Because the genome sequence has gaps and 
sequence errors such as frameshifts, it was not " 
always possible to predict a transcript that 
agrees precisely with the experimentally deter- 
mined cDNA sequence. A total of 6538 genes 
in our inventory were identified and transcripts 
predicted in this way. 

Regions that have a substantial amount of 
sequence similarity, but do not match known 
genes, were analyzed by that part of the Otto 
system that uses the sequence similarity in- 
formation to predict a transcript. Here, Otto 
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Fig. 6. Comparison of the CSA and the PFP assembly. 
(A) All of chromosome 21, (B) all of chromosome 8, 
and (C) a 1-Mb region of chromosome 8 representing 
a single Celera scaffold. To generate the figure, Celera 
fragment sequences were mapped onto each assem- 
bly. The PFP assembly is indicated in the upper third 
of each panel; the Celera assembly is indicated in the 
lower third. In the center of the panel, green lines 
show Celera sequences that are in the same order and 
orientation in both assemblies and form the longest 
consistently ordered run of sequences. Yellow lines 
indicate sequence blocks that are in the same orien- 
tation, but out of order. Red lines indicate sequence 
blocks that are not in the same orientation. For 
clarity, in the latter two cases, lines are only drawn 
between segments of matching sequence that are at 
least 50 kbp long. The top and bottom thirds of each 
panel show the extent of Celera mate-pair violations 
(red, misoriented; yellow, incorrect distance between 
the mates) for each assembly grouped by library sire. 
(Mate pairs that are within the correct distance, as 
expected from the mean library insert size, are omit- 
ted from the figure for clarity.) Predicted breakpoints, 
corresponding to stacks of violated mate pairs of the 
same type, are shown as blue ticks on each assembly 
axis. Runs of more than 10,000 Ns are shown as cyan 
bars. Plots of all 24 chromosomes can be seen in Web 
fig. 3 on Science Online at www.sciencemag.org/cgi/ 
content/f ull/29 1 /5 507/ 1 304/DC 1 . 
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evaluates evidence generated by the compu- 
tational pipeline, corresponding to conserva- . 
tion between mouse and human genomic 
DNA, similarity to human transcripts (ESTs 



and cDNAs), similarity to rodent transcripts 
(ESTs and cDNAs), and similarity of the 
translation of human genomic DNA to known 
proteins to predict potential genes in the hu- 



man genome. The sequence from the region 
of genomic DNA contained in a gene bin was 
extracted, and the subsequences supported by 
any homology evidence were marked (plus 100 
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Fig. 7. Schematic view of the distribution of breakpoints and large gaps assembly. Blue tick marks represent breakpoints, whereas red tick marks 

on all chromosomes. For each chromosome, the upper pair of fines represent a gap of larger than 10,000 bp. The number of breakpoints per 

represent the PFP assembly, and the lower pair of lines represent Celera's chromosome is indicated in black, and the chromosome numbers in red. 
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bases flanking these regions). The other bases 
in the region, those not covered by any homol- 
ogy evidence, were replaced by N's. This se- 
quence segment, with high confidence regions 
represented by the consensus genomic se- 
quence and the remainder represented by N's, 
was then evaluated by Genscan to see if a 
consistent gene model could be generated. This 
procedure simplified the gene-prediction task 
by first establishing the boundary for the gene 
(not a strength of most gene-finding algo- 
rithms), and by eliminating regions with no 
supporting evidence. If Genscan returned a 
plausible gene model, it was further evaluated 
before being promoted to an "Otto" annotation. 
The final Genscan predictions were often quite 
different from the prediction that Genscan re- 
turned on the same region of native genomic 
sequence. A weakness of using Genscan to 
refine the gene model is the loss of valid, small 
exons from the final annotation. 

The next step in defining gene structures 
based on sequence similarity was to compare 
each predicted transcript with the homology- 
based evidence that was used in previous steps 
to evaluate the depth of evidence for each exon 
in the prediction. Internal exons were consid- 
ered to be supported if they were covered by 
homology evidence to within ±10 bases of 
their edges. For first and last exons, the internal 
edge was required to be within 10 bases, but the 
external edge was allowed greater latitude to 
allow for 5' and 3' untranslated regions 
(UTRs). To be retained, a prediction for a 
multi-exon gene must have evidence such that 
the total number of "hits " as defined above, 
divided by the number of exons in the predic- 
tion must be >0,66 or must correspond to a 
RefSeq sequence. A single-exon gene must be 
covered by at least three supporting hits (±10 
bases on each side), and these must cover the 
complete predicted open reading frame. For 
a single-exon gene, we also required that 
the Genscan prediction include both a start 
and a stop codon. Gene models that did not 
meet these criteria were disregarded, and 

Table 7. Sensitivity and specificity of Otto and 
Genscan. Sensitivity and specificity were calculat- 
ed by first aligning the prediction to the published 
RefSeq transcript, tallying the number (W) of 
uniquely aligned RefSeq bases. Sensitivity is the 
ratio of N to the length of the published RefSeq 
transcript. Specificity is the ratio of N to the 
length of the prediction. Alt differences are signif- 
icant (Tukey HSD; P < 0.001). 



Method 


Sensitivity 


Specificity 


Otto (RefSeq only)* 


0.939 


0.973 


Otto (homology)t 


0.604 


0.884 


Genscan 


0.501 


0.633 



♦Refers to those annotations produced by Otto using only 
the Sim4-polished RefSeq alignment rather than an evi- 
dence-based Genscan prediction. titers to those 
annotations produced by supplying all available evidence 
to Genscan. 



those that passed were promoted to Otto 
predictions. Homology-based Otto predic- 
tions do not contain 3' and 5' untranslated 
sequence. Although three de novo gene-finding 
programs [GRAIL, Genscan, and FgenesH 
(63)] were run as part of the computational 
analysis, the results of these programs were not 
directly used in making the Otto predictions. 
Otto predicted 11,226 additional genes by 
means of sequence similarity. 

3.2 Otto validation 

To validate the Otto homology-based process 
and the method that Otto uses to define the 
structures of known genes, we compared tran- 
scripts predicted by Otto with their correspond- 
ing (and presumably correct) transcript from a 
set of 4512 RefSeq transcripts for which there 
was a unique SIM4 alignment (Table 7). In 
order to evaluate the relative performance of 
Otto and Genscan, we made three comparisons. 
The first involved a determination of the accu- 
racy of gene models predicted by Otto with 
only homology data other than the correspond- 
ing RefSeq sequence (Otto homology in Table 
7). We measured the sensitivity (correctly pre- 
dicted bases divided by the total length of the 
cDNA) and specificity (correctly predicted 
bases divided by the sum of the correctly and 
incorrectly predicted bases). Second, we exam- 
ined the sensitivity and specificity of the Otto 
. predictions that were made solely with the Ref- 
Seq sequence, which is the process that Otto . 
uses to annotate known genes (Otto-RefSeq). 
And third, we determined the accuracy of the 
Genscan predictions corresponding to these 
RefSeq sequences. As expected, the alignment 
method (Otto-RefSeq) was the most accurate, 
and Otto-homology performed better than Gen- 
scan by both criteria. Thus, 6.1% of true RefSeq 
nucleotides were not represented in the Otto- 
refseq annotations and 2.7% of the nucleotides 
in the Otto-RefSeq transcripts were not con- 
tained in the original RefSeq transcripts. The 
discrepancies could come from legitimate 
differences between the Celera assembly 
and the RefSeq transcript due to polymor- 
phisms, incomplete or incorrect data in the 
Celera assembly, errors introduced by Sim4 
during the alignment process, or the pres- 
ence of alternatively spliced forms in the 
data set used for the comparisons. 

Because Otto uses an evidence-based ap- 
proach to reconstruct genes, the absence of 
experimental evidence for intervening exons 
may inadvertantly result in a set of exons that 
cannot be spliced together to give rise to a 
transcript In such cases, Otto may "split genes" 
when in fact all the evidence should be com- 
bined into a single transcript We also examined 
the tendency of these methods to incorrectly 
split gene predictions. These trends are shown 
in Fig. 8. Both RefSeq and homology-based 
predictions by Otto split known genes into few- 
er segments than Genscan alone. 



3.3 Gene number : 

Recognizing that the Otto system is quite ]■ 
conservative, we used a different gene-pre- 
diction strategy in regions where the ho- i 
mology evidence was less strong. Here the : - 
results of de novo gene, predictions were 
used. For these genes, we insisted that a » \ 
predicted transcript have at least two of the , r 
following types of evidence to be included 
in the gene set for further analysis: protein, 
human EST, rodent EST, or mouse genome 
fragment matches. This fmal class of pre- 
dicted genes is a subset of the predictions 
made by the three gene-finding programs 
that were used in the computational pipe- 
line. For these, there , was not sufficient 
- sequence similarity information for Otto to 
attempt to predict a gene structure. The 
three de novo gene-finding programs re- 
sulted in about 155,695 predictions, of 
which —76,410 were nonredundant (non- 
overlapping with one another). Of these, 
57,935 did not overlap kriown genes or 
predictions made by Otto. Only 21,350 of 
the gene predictions that did not overlap 
Otto predictions were partially supported 
by at least one type of sequence similarity 
evidence, and 8619 were partially support- 
ed by two types of evidence (Table 8). 

The sum of this number (21,350) and the 
number of Otto annotations (17,764), 39,1 14, 
is near the upper limit for the human gene 
complement. As seen in Table 8, if the re- 
quirement for other . supporting evidence is 
made more stringent, this number drops rap- 
idly so that demanding two types of evidence 
reduces the total gene number to 26,383 and 
demanding three types reduces it to —23,000. 
Requiring that a prediction be supported by 
all four categories of evidence is too stringent 
because it would eliminate genes that encode 
novel proteins (members of currently unde- 
scribed protein families). No correction for 
pseudogenes has been made at this point in 
the analysis. 

In a further attempt to identify genes that 
were not found by the autoannotation process 
or any of the de novo gene finders, we ex- 
amined regions outside of gene predictions 
that were similar to the EST sequence, and 
where the EST matched the genomic se- 
quence across a splice junction. After correct- 
ing for potential 3' UTRs of predicted genes, 
about 2500 such regions remained. Addition 
of a requirement for at least one of the fol- 
lowing evidence. types— homology to mouse 
genomic sequence fragments, rodent ESTs, 
or cDNAs— or similarity to a known protein 
reduced this number to 1010. Adding this to 
the numbers from the previous paragraph 
would give us estimates of about 40,000, 
27,000, and 24,000 potential genes in the 
human genome, depending on the stringency 
of evidence considered. Table 8 illustrates the 
number of genes and presents the degree of 
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confidence based on the supporting evidence. 
Transcripts encoded by a set of 26,383 genes 
were assembled for further analysis. This set 
includes the 6538 genes predicted by Otto on 
the basis of matches to known genes, 1 1,226 
transcripts predicted by Otto based on homol- 
ogy evidence, and 8619 from the subset of 
transcripts from de novo gene-prediction pro- 
grams that have two types of supporting ev- 
idence. The 26,383 genes are illustrated along 
chromosome diagrams in Fig; 1. These are a 
very preliminary set of annotations arid are 
subject to all the limitations of an automated 
process. Considerable refinement is still nec- 
essary to improve the accuracy of these tran- 
script predictions. All the predictions and. 
descriptions of genes and the associated evi- 
dence that we present are the product of 
completely computational processes, not ex- 
pert curation. We have attempted to enumer- 
ate the genes in the human genome in such a 
way that we have different levels of confi- 
dence based on the amount of supporting 
evidence: known genes, genes with good pro- 
tein or EST homology evidence, and de novo 
gene predictions confirmed by modest ho- 
mology evidence. 

3.4 Features of human gene 
transcripts 

We estimate the average span for a "typi- 
cal" gene in the human DNA sequence to 
be about 27,894 bases. This is based on the 
average span covered by RefSeq tran- 
scripts, used because it represents our high- 
est confidence set. 

The set of transcripts promoted to gene 
annotations varies in a number of ways. As 
can be seen from Table 8 and Fig. 9, tran- 
scripts predicted by Otto tend to be longer, 
having on average about 7.8 exons, whereas 
those promoted from gene-prediction pro- 
grams average about 3.7 exons. The largest 
number of exons that we have identified in a 
transcript is 234 in the titin mRNA. Table 8 
compares the amounts of evidence that sup- 



port the Otto and other predicted transcripts. 
For example, one can see that a typical Otto 
transcript has 6.99 of its 7.81 exons supported 
by protein homology evidence. As would be 
expected, the Otto transcripts generally have 
more support than do transcripts predicted by 
the de novo methods. 

4 Genome Structure 

Summary. .This section describes several of 
the rioncoding attributes of the assembled 
genome sequence and their correlations with 
the predicted gene s&i. These include an anal- 
ysis of G+C content and gene density in the 
context of cytogenetic maps of the genome, 
an enumerative analysis of CpG islands, and 
a brief description of the genome-wide repet- 
itive elements. 



4.1 Cytogenetic maps 

Perhaps the most obvious, and certainly the 
■most, visible, element of the structure of 
the genome is the banding pattern produced 
by Giemsa stain. Chromosomal banding 
studies have revealed that about 17% to 
20% of the human chromosome comple- 
ment consists of C-bands, or constitutive 
heterochromatin (64). Much of this hetero- 
• chromatin is highly polymorphic and con- 
sists of different families of alpha satellite 
DNAs with various higher order repeat 
structures {65). Many, chromosomes have 
complex inter- and intrachromosomal du- 
plications present in pericentromeric re- 
gions (66). About 5% of the sequence reads 
were identified as alpha satellite sequences; 
these were not included in the assembly. 
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Fig. 8. Analysis of split genes resulting from different annotation methods. A set of 4512 
Sim4-based alignments of RefSeq transcripts to the genomic assembly were chosen (see the text 
for criteria), and the numbers of overlapping Genscan, Otto (RefSeq only) annotations based solely 
on Sim4-polished RefSeq alignments, and Otto (homology) annotations (annotations produced by 
supplying all available evidence to Genscan) were tallied. These data .show the degree to which 
multiple Genscan predictions and/or Otto annotations were associated with" a single RefSeq 
transcript. The zero class for the Otto-homology predictions shown here indicates that the 
Otto-homology calls were made without recourse to the RefSeq transcript, and thus no Otto call 
was made because of insufficient evidence. 



Table 8. Numbers of exons and transcripts supported by various types of evidence for Otto and de novo gene prediction methods. Highlighted cells indicate 
the gene sets analyzed in this paper (boldface, set of genes selected for protein analysis; italic, total set of accepted de novo predictions). * 







Total 




Types of evidence 






No. of lines of evidence* 








Mouse 


Rodent 


Protein 


Human 


SI 


2=2 


S3 


S4 


Otto 


Number of 


17,969 


17,065 


14,881 


15,477 


16,374 


17,968f 


17,501 


15,877 


12,451 




transcripts 
Number of 


141,218 


111,174 


89.569 


108,431 


118,869 


140,710 


127.955 


99,574 


59.804 


De novo 


exons 
Number of 


58,032 


14,463 


5,094 


8.043 


9,220 


2hS50 


8,619 


4.947 


1.904 




transcripts . 
Number of 


319,935 


48.594 


19,344 


26.264 


. 40,104 


79.148 


31,130 


17,508 


6,520 


No. of exons per 
transcript 


exons 
Otto 
De novo 


7.84 
5.53 


5.77 
3.17 


6.01 
3.80 


6.99 
3.27 


7.24 
4.36 


7.81 
3.7 


7.19 
3.56 


6.00 
3.42 


4.28 
3.16 



considered to support gene predictions from the different methods. The use of evidence is quite liberal, requiring only a partial match to a single exon of predicted transcript. Tin's 
number includes alternative splice forms of the 17,764 genes mentioned elsewhere In the text 
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Examination of pericentromeric regions is 
ongoing. 

The remaining "-80% of the genome, the 
euchromatic component, is divisible into G-, 
R-, and T-bands (67). These cytogenetic bands 
have been presumed to differ in their nucleotide 
composition and gene density, although we 
have been unable to determine precise band 
boundaries at the molecular level. T-bands are 
the most G+C- and gene-rich, and G-bands are 
G+C-poor (68). Bemardi has also offered a . 
description of the euchromatin at the molecular 
level as long stretches of DNA of differing base 
composition, termed isochores (denoted L, HI, 
H2, and H3), which are >300 kbp in length 
(69). Bemardi defined the L (light) isochores as 
G+C-poor (<43%), whereas the H (heavy) 
isochores fall into three G+C-rich classes rep- 
resenting 24, 8, and 5% of the genome. Gene 
concentration has been claimed to be very low 
in the L isochores and 20-fold more enriched in 
the H2 and H3 isochores (70). By exarriining 
contiguous 50-kbp windows of G+C content 
across the assembly, we found that regions of 
G+C content >48% (H3 isochores) averaged 
273.9 kbp in length, those with G+C content 
between 43 and 48% (HI +H2 isochores) aver- 
aged 202.8 kbp in length, and the average span 
of regions with <43% (L isochores) was 
1078.6 kbp. The correlation between G+C 
content and gene density was also examined in 
50-kbp windows along the assembled sequence 
(Table 9 and Figs. 10 and 11). We found that 
the density of genes was greater in regions of 
high G+C than in regions of low G+C content, 
as expected. However, the correlation between 
G+C content and gene density was not as 
skewed as previously predicted (69). A higher 
proportion of genes were located in the G+C- 
poor regions than had been expected. 

Chromosomes 17, 19, and 22, which have 
a disproportionate number of H3-containing 
bands, had the highest gene density (Table 
10). Conversely, of the chromosomes that we 



found to have the lowest gene density, X, 4, 
18, 13, and Y, also have the fewest H3 bands. 
Chromosome 15, which also has few H3 
bands, did not have a particularly low gene 
density in' our analysis. In addition, chromo- 
some 8, which we found to have a low gene 
density, does not appear to be unusual in its 
H3 banding. 
. .How. valid is Ohno's postulate (71) that 
mammalian genomes consist of oases of genes 
in otherwise essentially empty deserts? It ap- 

. pears that the human genome does indeed con- 
tain deserts, or large, gene-poor regions. If we 
define a desert as a region >500 kbp without a 

. gene, then we see that 605 Mbp, or about 20% 
of the . genome, is ■ in deserts. These are not 
uniformly distributed over the various chromo- 
somes. Gene-rich chromosomes 17, 19, and 22 
have only about 12% of their collective 171 
Mbp in deserts, whereas gene-poor chromo- 
somes 4, 13, 1 8, and X have 27.5% of their 492 
Mbp in deserts (Table 1 1). The apparent lack of 
predicted genes in these regions does not nec- 
essarily imply that they are devoid of biological 
function. 

4.2 Linkage map 

Linkage maps provide the basis for genetic 
analysis and are widely used in the study of the 
inheritance of traits and in the positional clon- 
ing of genes. The distance metric, centimorgans 
(cM), is based on the recombination rate be- . 
tween homologous chromosomes during meio- 

Table 9. Characteristics of G+C in isochores. 



sis. In general, the rate of recombination in 
females is greater than that in males, and this 
degree of map expansion is not uniform across 
the genome (72). One of the opportunities en- 
abled by a nearly complete genome sequence is 
to produce the ultimate physical map, and to 
fully analyze its correspondence with two other 
maps that have been widely used in genome 
and genetic analysis: the: linkage map and the 
cytogenetic map. This would close the loop 
between the mapping and sequencing phases of 
the genome project. 

We mapped the location of the markers 
that constitute the Genethon linkage map to 
the genome. The rate of recombination, ex- 
pressed as cM per Mbp, was calculated for 
3 -Mbp windows as shown in Table 12. High- 
er rates of recombination in the telomeric 
region of the chromosomes have been previ- 
ously documented (73). From this mapping 
result, there is a difference of 4.99 between 
lowest rates and highest rates and the largest 
difference of 4.4 between males and females 
(4.99 to 0.47 on chromosome 16). This indi- 
cates that the variability in recombination 
rates among regions of the genome exceeds 
the differences in recombination rates be- 
tween males and females. The human ge- 
nome has recombination hotspots, where re- 
combination rates vary fivefold or more over 
a space of 1 kbp, so the picture one gets of the 
magnitude of variability in recombination 
rate will depend on the size of the window 



Fig. 9. Comparison of 
the number of exons 
per transcript between 
the 17,968 Otto tran- 
scripts and 21,350 de 
novo transcript predic- 
tions with at least one 
tine of evidence that 
do not overlap with an 
Otto prediction. Both 
sets have the highest 
number of transcripts 
in the two-exon cate- 
gory, but the de novo 
gene predictions are 
skewed much more 
1 toward smaller tran- 
scripts. In the Otto set, 
19.7% of the tran- 
scripts have one or 
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have more than 20. In the de novo set, 49.3% of the transcripts have one or two exons, and 0.2% have more than 20. 
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examined. Unfortunately, too few meiotic 
crossovers have occurred in Centre d'Etude 
du Polymorphism Humain (CEPH) and other 
reference families to provide a resolution any 
ler than about 3 Mbp. The next challenge 
ill be to determine a sequence basis of 
recombination at the chromosomal level. An 
accurate predictor for the rate for variation in 
recombination rates between any pair of 
markers would be extremely useful in design- 
ing markers to narrow a region of linkage, 
such as in positional cloning projects.. 

4.3 Correlation between CpG islands 
and genes 

CpG islands are stretches of unmethylated 
DNA with a higher frequency of CpG 
dinucleotides when compared with the entire 
genome (74). CpG islands are believed to 
preferentially occur at the transcriptional start 
of genes, and it has been observed that most 
housekeeping genes have CpG islands at the 
5' end of the transcript (75, 76). In addition, 
experimental evidence indicates that CpG is- 
land methylation is correlated with gene in- 
activation (77) and has been shown to be 
important during gene imprinting (78) and 
tissue-specific gene expression (79) 

Experimental methods have been used 
that resulted in an estimate of 30,000 to 
45,000 CpG islands in the human genome 
(74, 80) and an estimate of 499 CpG islands 
on human chromosome 22 (8 J). Larsen et 
I. (76) and Gardiner-Garden and Frommer 
75) used a computational method to iden- 
tify CpG islands and defined them as re- 
gions of DNA of >200 bp that have a G + C 
content of >50% and a ratio of observed 



versus expected frequency of CG dinucle- 
otide >0.6. 

It is difficult to make a direct compari- 
son of experimental definitions of CpG is- 
lands with computational definitions be- 
cause computational methods do not con- 
sider the methylation state of cytosine and 
experimental methods do not directly select 
regions of high G+C content. However, we 
can determine the correlation of CpG island , 

.' with gene ;; starts, given a set of annotated * 
genomic transcripts arid the whole genome 
sequence. We have analyzed the publicly 
available annotation of chromosome 22, as 
well as using the entire human genome in 
our assembly and the computationally an- 
notated genes. A variation of the CpG is- 
land computation was compared with 
Larsen et aL (75). The main differences are 
that we use a sliding window of 200 bp, 
consecutive windows are merged only if 
they overlap, and we recompute the CpG 
value upon merging, thus rejecting any po- 
tential island if it scores less than the 
threshold. 

To compute various CpG statistics, we 
used two different thresholds of CG dinucle- 
otide likelihood ratio. Besides using the orig- 
inal threshold of 0.6 (method 1), we used a 
higher threshold of CG dinucleotide likeli- 
hood ratio of 0.8 (method 2), which results in 
the number of CpG islands on chromosome 
22 close to the number of annotated genes on 
this chromosome. The main results are sum- 

. marized in Table 13. CpG islands computed, 
with method 1 predicted only 2.6% of the 
CSA sequence as CpG, but 40% of the gene 
starts (start codons) are contained inside a 
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Fig. 10. Relation between G+C content and gene density. The blue bars show the percent of the 
" eenome (in 50-kbp windows) with the indicated G+C content. The percent of the total number of 
lenes associated with each G+C bin is represented by the yellow bars. The graph shows that about 
5% of the genome has a G+C content of between 50 and 55%, but that this portion contains 
nearly 15% of the genes. 



CpG island. This is comparable to ratios re- 
ported by others (82). The last two rows of 
the table show, the observed and expected 
average distance, respectively, of the closest 
CpG island from the first exon. The observed 
average closest CpG islands are smaller than 
the corresponding expected distances, con- 
fiirning an association between CpG island 
and the first exon. .. 

We also looked at the distribution of CpG 
island nucleotides among various sequence 
classes such as intergenic region's, introns, 
exons, and first exons. We computed the 
likelihood score for each sequence class as 
the ratio of the observed fraction of CpG 
island nucleotides in that sequence class 
and the expected fraction of CpG island 
nucleotides in that sequence class. The re- 
sult of applying method 1 on CSA were 
scores of 0.89 for intergenic region, 1.2 for 
intron, 5.86 for exon, and 13.2 for first 
exon. The same trend was also found for 
chromosome 22 and after the application of 
a higher threshold (method 2) on both data 
sets. In sum, genome-wide analysis has 
extended earlier analysis and suggests a 
strong correlation between CpG islands and 
first coding exons. 

4.4 Genome-wide repetitive elements 
The proportion of the genome covered by 
various classes of repetitive DNA is present- 
. ed in Table 14. We observed about 35% of 
the genome in these repeat classes, very sim- 
ilar to values reported previously (83). Repet- 
itive sequence may be underrepresented in 
the Celera assembly as a result of incomplete 
repeat resolution, as discussed above. About 
8% of the scaffold length is in gaps, and we 
expect that much of this is repetitive se- 
quence. Chromosome 19 has the highest re- 
peat density (57%), as well as the highest 
gene density (Table 10). Of interest, among 
the different classes of repeat elements, we 
observe a clear association of Alu elements 
and gene density, which was not observed 
between LINEs and gene density. 

5 Genome Evolution 

Summary, The dynamic nature of genome 
evolution can be captured at several levels. 
These include gene duplications mediated by 
RNA intermediates (retrotransposition) and 
segmental genomic duplications. In this sec- 
tion, we document the genome-wide occur- 
rence of retrotransposition events generating 
functional (intronless paralogs) or inactive 
genes (pseudogenes). Genes involved in 
translational processes and nuclear regulation 
account for nearly 50% of all intronless para- 
logs and processed pseudogenes detected in 
our survey. We have also cataloged the extent 
of segmental genomic duplication and pro- 
vide evidence for 1077 duplicated blocks 
covering 3522 distinct genes. 
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Fig. 11 (continued). Relation among gene density (orange), G+C content dows. The percent of C+C nucleotides was calculated in 100-kbp 
(green), EST density (blue), and Alu density (pink) along the lengths of windows. The number of ESTs and Alu elements is shown per 100-kbp 
each of the chromosomes. Gene density was calculated in 1-Mbp win- window. 



5.1 Retrotransposition in the human a duplication event. The existence of both events in cellular biology. Identification of 
genome intron-containing and intronless forms of conserved intronless paralogs in the mouse 
Retrotransposition of processed mRNA genes encoding functionally similar or or other mammalian genomes should pro- 
transcripts into the genome results in func- identical proteins has been previously de- vide the basis for capturing the evolution- 
tional genes, called intronless paralogs, or scribed {84, 85). Cataloging these evolu- ary chronology of these transposition 
^activated genes (pseudogenes). A paralog tionary events on the genomic landscape is events and provide insights into gene loss 
mfers to a gene that appears in more than of value in understanding the functional and accretion in the mammalian radiation. 
Bne copy in a given organism as a result of consequences of such gene-duplication A set of proteins corresponding to all 901 
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Otto-predicted, single-exon genes were sub- 
jected to BLAST analysis against the proteins 
encoded by the remaining multiexon predict- 
ed transcripts. Using homology criteria of 
70% sequence identity over 90% of the 
length, we identified 298 instances of single- 
to multi-exon correspondence. Of these 298 
sequences, 97 were represented in the Gen- 
Bank data set of experimentally validated 
full-length genes at the stringency specified 
and were verified by manual inspection. . 

. We believe, that these 97 cases may rep- 
resent intronless paralogs (see Web table 1 on 
Science Online at www.sciencemag.org/cgi/ 
content/full/291/5507/1304/DCl) of known 
genes. Most of these are flanked by direct 
repeat sequences, although the precise nature 
of these repeats remains to be determined. All 
of the cases for which we have high confi- 
dence contain polyadenylated [poly(A)] tails 
characteristic of retrotransposition. 

Recent publications describing the phe- 
nomenon of functional intronless paralogs 
speculate that retrotransposition may serve as 
a mechanism used to escape X-chromosomal 
inactivation (84 , 86). We do not find a bias 
toward X chromosome origination of these 
retrotransposed genes; rather, the results 
show a random chromosome distribution of 
both the intron-containing and corresponding 
intronless paralogs. We also have found sev- 
eral cases of retrotransposition from a single 
source chromosome to multiple target chro- 
losomes. Interesting examples include the 
retrotransposition of a five exon-containing 
ribosomal protein L21 gene on chromosome 
13 onto chromosomes 1, 3, 4, 7, 10, and 14, 
respectively. The size of the source genes can 
also show variability. The largest example is 
the 31-exon diacylglycerol kinase zeta gene 
on chromosome 11 that has an intronless 
paralog on chromosome 13. Regardless of 
route, retrotransposition with subsequent 
gene changes in coding or noncoding regions 
that lead to different functions or expression 
patterns, represents a key route to providing 
an enhanced functional repertoire in mam- 
mals (87). 

Our preliminary set of retrotransposed in- 
tronless paralogs contains a clear overrepre- 
sentatiori of genes involved in translational 
processes (40% ribosomal proteins and 10% 
translation elongation factors) and nuclear 
regulation (HMG nonhistone proteins, 4%), 
as well as metabolic and regulatory enzymes. 
EST matches specific to a subset of intronless 
paralogs suggest expression of these intron- 
less paralogs. Differences in the upstream 
regulatory sequences between the source 
genes and their intronless paralogs could ac- 
count for differences in tissuerspecific gene 
expression. Defining which, if any, of these 
rocessed genes are functionally expressed 
and translated will require further elucidation 
and experimental validation. 



the Human Genome 
5.2 Pseudogenes 

A pseudogene is a nonfunctional copy that is 
very similar to a normal gene but that has 
been altered slightly so that it is not ex- 
Table 11. Genome overview. 



pressed. We developed a method for the pre- 
liminary analysis of processed pseudogenes 
in the. human genome as a starting point in 
elucidating the ongoing evolutionary forces 



Size of the genome (including gaps) 
Size of the genome (excluding gaps) 
Longest contig '.* *'. 
- Longest scaffold *■. 
Percent of AfT. In the genome 
Percent of C+C in the genome 
Percent of undetermined bases in the genome 
Most GC-rich 50 kb 
Least GC-rich 50 kb 
Percent of genome classified as repeats 
Number of annotated genes 
Percent of annotated genes with unknown function 
Number of genes (hypothetical and annotated) 
Percent of hypothetical and annotated genes with unknown function 
Gene with the most exons 
Average gene size 
Most gene-rich chromosome 
Least gene-rich chromosomes 

Total size of gene deserts (>500 kb with no annotated genes) 
Percent of base pairs spanned by genes 
Percent of base pairs spanned by exons 
Percent of base pairs spanned by introns 
Percent of base pairs in intergenic DNA 

Chromosome with highest proportion of DNA in annotated exons 
Chromosome with lowest proportion of DNA in annotated exons 
Longest intergenic region (between annotated + hypothetical genes) 
Rate of SNP variation 



2.91 Gbp 
2.66 Gbp 

1.99 Mbp ' ■"„•■■ 

14.4 Mbp . 
54' 

38 
9 

Chr. 2 (66%) 
Chr. X (25%) 
35 

26,383 
42 

39,114 
59 

Titin (234 exons) 
27 kbp 

Chr. 19 (23 genes/Mb) 
Chr. 13 (5 genes/Mb), 
Chr. Y (5 genes/Mb) 
605 Mbp y 

25.5 to 37.8* 
1.1 to 1.4* 

24.4 to 36.4* 

74.5 to 63.6* 
Chr. 19 (9.33) 
Chr. Y (0.36) 

Chr. 13 (3,038,416 bp) 
1/1250 bp 



*)n these ranges, the percentages correspond to the annotated gene set (26, 383 genes) and the hypothetical + 
annotated gene set (39,1 H.genes). respectively. 



Table 12. Rate of recombination per physical distance (cM/Mb) across the genome. Genethon markers 
were placed on CSA-mapped assemblies, and then relative physical distances and rates were calculated 
in 3-Mb windows for each chromosome. NA, not applicable. 



Male 



Chrom. 



Sex-average 



Female 





Max. 


Avg. 


Min. 


Max. 


Avg. 


Min. 


Max. 


Avg. 


Min. 


1 


2.60 


1.12 


0.23 


2.81 


1.42 


0.52 


3.39 


1.76 


0.68 


2 


2.23 


0.78 


0.33 


2.65 


1.12 


0.54 


3.17 


1.40 


0.61 


3 


2.55 


0.86 


0.23 


2.40 


1.07 


0.42 


2.71 


1.30 


0.33 


4 


1.66 


0.67 


0.15 


2.06 


1.04 


0.60 


2.50 


1.40 


0.77 


5 


2.00 


0.67 


0.18 


1.87 


1.08 


0.42 


2.26 


1.43 


0.62 


6 


1.97 


6.71 


0.28 


2.57 


1.12 


0.37 


3.47 


1.67 


0.64 


7 


2.34 


1.16 


0.48 


1.67 


1.17 


0.47 


2.27 


1.21 


0.34 


8 


. 1.83 


0.73 


0.14 


2.40 


1.05 


0.46 


3.44 . 


1.36 


0.43 


9 


2.01 


0.99 


0.53 


1.95 


1.32 


0.77 


2.63 


'1.66 


0.82 


10 


.3.73 


1.03 


0.22 


3.05 


1.29 


. 0.66 


2.84 


1.51 


0.76 


11 


1.43 


0.72 


0.31 


2.13 


0.99 


0.47 


3.10 


1.32 


0.49 


12 


4.12 


0.76 


0.26 


3.35 


1.16 


0.49 


2.93 


1.55 


0.59 


13 


1.60 


0.75 


0.01 


1.87 


0.95 


0.17 


2.49 


1.19 


0.32 


14 


3.15 


0.98 


0.18 


2:65 


1.30 


0.62 


3.14 


1.63 


0.75 


15 


2.28 


0.94 


0.34 


2.31 


1.22 


0.42 


2.53 


1.56 


0.54 


16 


1.83 


1.00 


0.47 


2.70 


1.55 


0.63 


4.99 


2.32 


1.12 


17 


3.87 


0.87 


0.00 


3.54 


1.35 


0.54 


4.19 


1.83 


0.94 


18 


3.12 


1.37 


0.86 


3.75 


1.66 


0.43 


4.35 


2.24 


0.72 


19 


3.02 


0.97 


0.10 


2.57 


1.41 


0.49 


. 2.89 


1.75 


0.87 


20 


3.64 


0.89 


0.00 


2.79 


1.50 


0.83 


3.31 


2.15 • 


1.34 


21 


3.23 


126 


0.69 


2.37 


1.62 


1.08 


2.58 


1.90 


1.18 


22 


1.25 


1.10 


0.84 


1.88 


1.41 


1.08 


3.73 


2.08 


0.93 


X 


NA 


NA 


NA 


NA 


NA 


NA 


3.12 


1.64 


0.72 


Y 


NA 


NA 


NA 


NA 


NA 


NA 


NA 


NA 


NA 


Genome 


4.12 


0.88 


0.00 


3.75 


1.22 


0.17 


4.99 


1.55 


0.32 
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. that account for gene inactivation. The gen- 
eral structural characteristics of these pro- 
cessed pseudogenes include the complete 
lack of intervening sequences found in the. 
functional counterparts, a poly(A) tract at the 
3' end, and direct repeats flanking the pseu- 
dogene sequence. Processed pseudogenes oc- 
cur as a result of retrotransposition, whereas 
unprocessed pseudogenes arise from segmen- 
tal genome duplication. 

We searched the complete set of Otto- 
predicted transcripts against the genomic, se- 
quence by means of BLAST. Genomic re- 
gions corresponding to all Otto-predicted 
transcripts were excluded from this analysis. 
We identified 2909 regions matching with 
greater than 70% identity over at least 70% of 
the length of the transcripts that likely repre- 
sent processed pseudogenes. This number is 
probably an underestimate because specific 
methods to search for pseudogenes were not 
used. 

. We looked for correlations between 
structural elements and the propensity for 
retrotransposition in the human genome. 
GC content and transcript length were com- 
pared between the genes with processed 



pseudogenes (1177 source genes) versus 
the remainder of the predicted gene set. 
Transcripts that give rise to processed pseu- 
dogenes have shorter average transcript 
length (1027 bp versus 1594 bp for the Otto 
set) as compared with genes for which no 
pseudogene was detected. The overall GC 
. content did not show any significant differ- 
ence, contrary to a recent report (88). There 
is a clear trend in gene families that are 
present as processed pseudogenes. These 
include ribosomal proteins (67%), lamin 
receptors (10%), translation elongation fac- 
tor alpha (5%), and HMG-non-histone pro- 
teins (2%). The increased occurrence of 
retrotransposition (both intronless paralogs 
and processed pseudogenes) among genes 
involved in translation and nuclear regula- 
tion may reflect an increased transcription- 
al activity of these genes. 

5.3 Gene duplication in the human 
genome 

Building on a previously published procedure 
(27), we developed a graph-theoretic algo- 
rithm, called Lek, for grouping the predicted 
human protein set into protein families (89). 



Table 13. Characteristics of CpG islands identified in chromosome 22 (34-Mbp sequence length) and the 
whole genome (2.9-Gbp sequence length) by means, of two different methods. Method 1 uses a CG 
likelihood ratio of SrO.6. Method 2 uses a CG likelihood ratio of ^0.8. 

Chromosome 22 Whole genome 

(CS assembly) 



Method 1 Method 2 Method 1 Method 2 



Number of CpG islands 

detected 
Average length of island (bp) 
Percent of sequence 

predicted as CpG 
Percent of first exons that 

overlap a CpG island 
Percent of first exons with 

first position of exon 

contained inside a CpG 

island 

Average distance between 
. first exon and closest CpG 

island (bp) 
Expected distance between 

first exon and closest CpG 

island (bp) 



5,211 


522 


195,706 


26,876 


390 


535 


395 


497 


5.9 


0.8 


2.6 


0.4 


44 


25 


42 


22 


37 


22 


40 


21 


1,013 


10,486 


2,182 


17,021 


3,262 


32,567 


7,164 


55,811 



Table 14. Distribution of repetitive DNA in the compartmentalized shotgun assembly sequence. 



Repetitive elements 


Megabases in 
assembled 
sequences 


Percent 
of 

assembly 


Previously 
predicted 
(%) (83) 


Alu 


288 


9.9 


10.0 


Mammalian interspersed repeat (MIR) 


66 


2.3 


1.7 


Medium reiteration (MER) 


50 . 


1.7 


1.6 


Long terminal repeat (LTR) 


155 


5.3 


5.6 


Long interspersed nucleotide element 


466 


16.1 


16.7 


(LINE) 








Total 


1025 


35.3 


35.6 



The complete clusters that result from the 
Lek clustering provide one basis for compar- 
ing the role of whole-genome or chromosom- 
al duplication in protein family expansion as 
opposed to other means, such as tandem du- 
plication. Because each complete cluster rep- 
resents a closed and certain island of homol- 
ogy, and because Lek is capable of simulta- 
neously clustering protein complements of 
several organisms, the number of proteins 
contributed by each organism to a complete 
cluster can be predicted with confidence de- 
pending on the quality of the annotation of 
each genome. The variance of each organ- 
ism's contribution to each cluster can then be 
calculated, allowing an assessment of the rel- 
ative importance of large-scale duplication 
versus smaller-scale, organism-specific ex- 
pansion and contraction of protein families, 
presumably as a result of natural selection 
operating on individual protein families with- 
in an organism. As can be seen in Fig. 12, the 
large variance in the relative numbers of hu- 
man as compared with D. melanogaster and 
Caenorhabditis elegans proteins in complete 
clusters may be explained by multiple events 
of relative expansions in gene families in 
each of the three animal genomes. Such ex- 
pansions would give rise to the distribution 
that shows a peak at 1:1 in the ratio for 
human-worm or human-fly clusters with the 
slope spread covering both human and fly/ 
worm predominance, as we observed (Fig. 
12). Furthermore, there are nearly as many 
clusters where worm and fly proteins pre- 
dominate despite the larger numbers of pro- 
teins in the human. At face value, this anal- 
ysis suggests that natural selection acting on 
individual protein families has been a major 
force driving the expansion of at least some 
elements of the human protein set. However, 
in our analysis, the difference between an 
ancient whole-genome duplication followed 
by loss, versus piecemeal duplication, cannot 
be easily distinguished. In order to differen- 
tiate these scenarios, more extended analyses 
were performed. 

5.4 Large-scale duplications 

Using two independent methods, we 
searched for large-scale duplications in the 
human genome. First, we describe a protein 
family-based method that identified highly 
conserved blocks of duplication. We then 
describe our comprehensive method for identi- 
fying all interchromosomal block duplications. 
The latter method identified a large number of 
duplicated chromosomal segments covering 
parts of all 24 chromosomes. 

The first of the methods is based on the 
idea of searching for blocks of highly con- 
served homologous proteins that occur in 
more than one location on the genome. For 
this comparison, two genes were considered 
equivalent if their protein products were de- 
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termined to be in the same family and the 
same complete Lek cluster (essentially 
paralogous genes) (89). Initially, each chro- 

Imosome was represented as a string of genes 
dered by the start codons for predicted 
nes along the chromosome. We considered 
the two strands as a single string, because 
iocal inversions are relatively common events 
relative to large-scale duplications. Each 
gene was indexed according to the protein 
family and Lek complete cluster {89). All 
pairs, of . indexed gene . strings . were then 
aligned in both the forward and reverse di- 
rections with the Smith- Waterman algorithm 
(90). A match between two proteins of the 
same Lek complete cluster was given a score 
of 10 and a mismatch -10, with gap open 
and extend penalties of —4 and -1. With 
these parameters, 19 conserved interchromo- 
somal blocks of duplication were observed, 
all of which were also detected and expanded 
by the comprehensive method described be- 
low. The detection of only a relatively small 
number of block duplications was a conse- 
quence of using an intrinsically conservative 
method grounded in the conservative con- 
straints of the complete Lek clusters. 

In the second, more comprehensive ap- 
proach, we aligned all chromosomes directly 
with one another using an algorithm based on 
the MUMmer system (91). This alignment 
method uses a suffix tree data structure and a 
linear-time algorithm to align long sequences 
ry rapidly; for example, two chromosomes 
f 100 Mbp can be aligned in less than 20 
min (on a Compaq Alpha computer) with 4 
gigabytes of memory. This procedure was 
used recently to identify numerous large- 
scale segmental duplications among the five 
chromosomes of A. thaliana (92); in that 
organism, the method revealed that 60% of 
the genome (66 Mbp) is covered by 24 very 
large duplicated segments. For Arabidopsis, a 
DNA-based alignment was sufficient to re- 
veal the segmental duplications between 
chromosomes; in the human genome, DNA 
alignments at the whole-chromosome level 
are insufficiently sensitive. Therefore, a mod- 
ified procedure was developed and applied, 
as follows. First, all 26,588 proteins 
(9,675,713 million amino acids) were concat- 
enated end-to-end in order as they occur 
along each of the 24 chromosomes, irrespec- 
tive of strand location. The concatenated pro- 
tein set was then aligned against each chro- 
mosome by the MUMmer algorithm. The 
resulting matches were clustered to extract all 
sets of three or more protein matches that 
occur in close proximity on two different 
chromosomes (95); these represent the can- 
didate segmental duplications. A series of 
filters were developed and applied to remove 
ikely false-positives from this set; for exam- 
ple, small blocks that were spread across 
many proteins were removed. To refine the 



filtering methods, a shuffled protein set was 
first created by taking the 26,588 proteins, 
randomizing their order, and then partitioning 
them into 24 shuffled chromosomes, each 
containing the same number of proteins as the 
true genome. This shuffled protein set has the 
identical composition to the real genome; in 
particular, every protein and every domain 
appears the same number of times. The com- 
plete algorithm was then applied to both the 
real and the shuffled data, with the results on . 
the shuffled data being used .to estimate the . 
false-positive rate. The algorithm after filter- 
ing yielded 10,310 gene pairs in 1077 dupli- 
cated blocks containing 3522 distinct genes; 
tandemly duplicated expansions in many of 
the blocks explain the excess of gene pairs to 
distinct genes. In the shuffled data, by con- 
.trast, only 370 gene pairs were found, giving 
a false-positive estimate of 3.6%. The most 
likely explanation for the 1077 block dupli- 
cations is ancient segmental duplications. In 
many cases, the order of the proteins has been 
shuffled, although proximity is preserved. 
Out of the 1077 blocks, 159 contain only 
three genes, 137 contain four genes, and 781 
contain five or more genes. 

To illustrate the extent of the detected 
duplications, Fig. 13 shows all 1077 block 
duplications indexed to each chromosome in 
24 panels in which only duplications mapped 
to the indexed chromosome are displayed. 
The figure makes it clear that the duplications 
are ubiquitous in the genome. One feature 
that it displays is many relatively small chro- 
mosomal stretches, with one-to-many dupli- 
cation relationships that are graphically strik- 
ing. One such example captured by the anal- 
ysis is the well-documented olfactory recep- 
tor (OR) family, which is scattered in blocks 
throughout the genome and which has been 
analyzed for genome-deployment reconstruc- 



tions at several evolutionary stages (94). The 
figure also illustrates that some chromo- 
somes, such as chromosome 2, contain many 
more detected large-scale duplications than 
others. Indeed, one of the largest duplicated 
segments is a large block of 33 proteins on 
chromosome 2, spread among eight smaller 
blocks in 2p, that aligns to a paralogous set on 
chromosome 14, with one rearrangement (see 
chromosomes 2 and 14 panels in Fig. 13). 
. The proteins are not contiguous but span a 
region containing 97 proteins oh chromo- 
some 2 and 332 proteins on chromosome 14. 
The likelihood of observing this many dupli- 
cated proteins by chance, even over a span of 
this length, is 2.3 X 10" 68 (93). This dupli- 
cated set spans 20 Mbp on chromosome 2 and 
63 Mbp on chromosome 14, over 70% of the 
latter chromosome. Chromosome 2 also con- 
tains a block duplication that is nearly as 
large, which is shared by chromosome arm 2q 
and chromosome 12. This duplication incor- 
porates two of the four known Hox gene 
clusters, but considerably expands the extent 
of the duplications proximally and distally on 
the pair of chromosome arms. This breadth of 
duplication is also seen on the two chromo- 
somes carrying the other two Hox clusters. 

An additional large duplication, between 
chromosomes 18 and 20, serves as a good 
example to illustrate some of the features 
common to many of the other observed large 
duplications (Fig. 13, inset): This duplication 
contains 64 detected ordered intrachromo- 
somal pairs of homologous genes. After dis- 
counting a 40-Mb stretch of chromosome 18 
free of matches to chromosome 20, which is 
likely to represent a large insert (between the 
gene assignments "Krup rel" and "collagen 
rel" on chromosome 18 in Fig. 13), the full 
duplication segment covers 36 Mb on chro- 
mosome 18 and 28 Mb on chromosome 20. 
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Fig. 12. Gene duplication in complete protein clusters. The predicted protein sets of human, worm, 
and fly were subjected to Lek clustering (27). The numbers of clusters with varying ratios (whole 
number) of human versus worm and human versus fly proteins per cluster were plotted. 
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By this measure, the duplication segment 
spans nearly half of each chromosome's net 
length. The most likely scenario is that the 
whole span of this region was duplicated as a 
single very large block, followed by shuffling 
owing to smaller scale rearrangements. As 
such, at least four subsequent rearrangements 
would need to be invoked to explain the 
relative insertions and inversions seen in the 
duplicated segment interval. The 64 protein 
pairs in this alignment occur among 217 pro- 
tein assignments on chromosome 18, and 
among 322 protein assignments on chromo- 
some 20, for a density of involved proteins of 
.20 to 30%. This is consistent with an ancient 
large-scale duplication followed by subse- 
quent gene loss on one or both chromosomes. 
Loss of just one member of a gene pair 
subsequent to the duplication would result in 
a failure to score a gene pair in the block; less 
than 50% gene loss on the chromosomes 
would lead to the duplication density ob- 
served here. As' an independent verification 
of the significance of the alignments detect- 
ed, it can be seen that a substantial number of 
the pairs of aligning proteins in this duplica- 
tion, including some of those annotated (Fig. 
13), are those populating small Lek complete 
clusters (see above). This indicates that they 
are members of very small families of para- 
. logs; their relative scarcity within the genome 
validates the uniqueness and robust nature of 
their alignments. 

Two additional qualitative features were ob- 
served among many of the large-scale duplica- 
tions. First, several proteins with disease asso- 
ciations, with OMIM (Online Mendelian Inher- 
itance in Man) assignments, are members of 
duplicated segments (see web table 2 on Sci- 
ence Online at www.sciencemag.org/cgi/con- 
tent/full/291/5507/1304/DCl). We have also 
observed a few instances where paralogs on 
both duplicated segments are associated with 
similar disease conditions. Notable among 
these genes are proteins involved in hemostasis 
(coagulation factors) that are associated with 
bleeding disorders, transcriptional regulators 
like the homeobox proteins associated with de- 
velopmental disorders, and potassium channels 
associated with cardiovascular conduction ab- 
normalities. For each of these disease genes, 
closer study of the paralogous genes in the 
duplicated segment may reveal new insights 
into disease causation, with further investiga- 
tion needed to determine whether they might be 
involved in the same or similar genetic diseases. 
Second, although there is a conserved number 
of proteins and coding exons predicted for spe- 
cific large duplicated spans within the chromo- 
some 18 to 20 alignment, the genomic DNA of 
chromosome 18 in these specific spans is in 
some cases more than 10-fold longer than the 
corresponding chromosome 20 DNA. This se- 
lective accretion of noncoding DNA (or con- 
versely, loss of noncoding DNA) on one of a 



. pair of duplicated chromosome regions was 
observed in many compared regions. Hypothe- 
ses to explain which mechanisms foster these 
processes must be tested. 

Evaluation of the alignment results gives 
some perspective on dating of the duplications. 
As noted above, large-scale ancient segmental 
duplication in fact best explains many of the 
blocks detected by this genome-wide analysis. 
The regions of human chromosomes involved 
in the large-scale duplications expanded upon 
above (chromosomes 2 to 14, 2 to 12, and 18 to 
20) are each syntenic to a cUstinct mouse chro- 
mosomal region. The corresponding mouse: 
. , chromosomal regions are much more similar in 
sequence conservation, and even in order, to 
their human synteny partners than the human 
duplication regions are to each other. Further, 
the corresponding mouse chromosomal regions 
each bear a significant proportion of genes or- 
thologous to the human genes on which the 
human duplication assignments were made. On 
the basis of these factors, the corresponding 
mouse chromosomal spans, at coarse resolu- 
tion, appear to be products of the same large- 
scale duplications observed in humans. Al- 
though further detailed analysis must be carried 
out once a more complete genome is assembled 
for mouse, the underlying large duplications 
appear to predate the two species* divergence. 
This dates the duplications, at the latest, before 
.divergence of the primate and rodent lineages. 
This date can be further refined upon examina- 
tion of the synteny between human chromo- 
somes and those of chicken, pufferfish (Fugu 
rubripes), or zebrafish (95). The only sub- 
stantial syntenic stretches mapped in these 
species corresponding to both pairs of human 
duplications are restricted to the Hox cluster 
regions. When the synteny of these regions 
(or others) to human chromosomes is extend- 
ed with further mapping, the ages of the 
nearly chromosome-length duplications seen 
in humans are likely to be dated to the root of 
vertebrate divergence. 

The MUMmer-based results demonstrate 
large block duplications that range in size from 
a few genes to segments covering most of a 
chromosome. The extent of segmental duplica- 
tions raises the question of whether an ancient 
whole-genome duplication event is the under- 
lying explanation for the numerous duplicated 
regions (96). The duplications have undergone 
many deletions and subsequent rearrangements; 
these events make it difficult to distinguish 
between a whole-genome duplication and mul- 
tiple smaller events. Further analysis, focused 
especially on comparing the estimated ages of 
all the block duplications, derived partially 
from interspecies genome comparisons, will be 
necessary to determine which of these two hy- 
potheses is more likely. Comparisons of ge- 
nomes of different vertebrates, and even cross- 
phyla genome comparisons, will allow for the 
deconvolution of duplications to eventually re- 



veal the stagewise history of our genome, and 
with it a history of the emergence of many of 
the key functions that distinguish us from other 
living things. 

6 A Genome-Wide Examination of 
Sequence Variations 

Summary. Computational methods were used 
to identify , single-nucleotide polymorphisms 
(SNPs) by comparison of the Celera sequence 
to other SNP resources. The SNP rate be- 
tween two chromosomes was —1 per 1200 to 
1500 bp. SNPs are distributed nonrandomly 
throughout the genome. Only a very small 
proportion of all SNPs (<1%) potentially 
impact protein function based on the func- 
tional analysis of SNPs that affect the pre- 
dicted coding regions. This results in an cs- 
timate that only thousands, not millions, of 
genetic variations may contribute to the struc- 
tural diversity of human proteins. 

Having a complete genome sequence enables 
researchers to achieve a dramatic acceleration 
in the rate of gene discovery, but only through 
analysis of sequence variation in DNA can wc 
discover the genetic basis for variation in health 
among human beings. Whole-genome shotgun 
sequencing is a particularly effective method 
for detecting sequence variation in tandem with 
whole-genome assembly. In addition, we com- 
pared the . distribution and attributes of SNPs 
ascertained by three other methods: (i) align- 
ment of the Celera consensus sequence to the 
PFP assembly, (ii) overlap of high-quality reads 
of genomic sequence (referred to as "Kwok"; 
1,120,195 SNPs) (97), and (iii) reduced repre- 
sentation shotgun sequencing (referred to as 
"TSC"; 632,640 SNPs) (98). These data were 
consistent in showing an overall nucleotide di- 
versity of -8 X 10~ 4 , marked heterogeneity 
across the genome in SNP density, and on 
overwhelming preponderance of noncoding 
variation that produces no change in expressed 
proteins. 

6.1 SNPs found by aligning the Celera 
consensus to the PFP assembly 

Ideally, methods of SNP discovery make full 
use of sequence depth and quality at every site, 
and quantitatively control the rate of false-pos- 
itive and false-negative calls with an explicit 
sampling model (99). Comparison of consensus 
sequences in the absence of these details neces- 
sitated a more ad hoc approach (quality scores 
could not readily be obtained for the PFP as- 
sembly). First, all sequence differences between 
the two consensus sequences were identified; 
these were then filtered to reduce the contribu- 
tion of sequencing errors and misassembly. As 
a measure of the effectiveness of the filtenng 
step, we monitored the ratio of transition and 
transversion substitutions, because a 2:1 ratio 
has been well documented as typical in mam- 
malian evolution (100) and in human SM> 
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(101, 102). Tne filtering steps consisted of re- 
moving variants where the quality score in the 
Celera consensus was less than 30 and where 
the density of variants was greater than 5 in 400 
bp. These filters resulted in shifting the transi- 
Jfion-to-transversion ratio from 1.57:1 to 
1.89:1. When applied to 2.3 Gbp. of alignments 
between the Celera and PFP consensus se- 
quences, these filters resulted in identification 
of 2,104,820 putative SNPs from a total of 
2,778,474 substitution differences. Overlaps 
between this set of SNPs and those found by 
other methods are described below. ' : 
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6.2 Comparisons to public SNP 
databases 

Additional SNPs, including 2,536,021 from 
dbSNP (www.ncbi.nlm.nih.gov/SNP) and 
13,150 from HGMD (Human Gene Muta- 
tion Database, from the University of 
Wales, UK), were mapped on the Celera con- 
sensus sequence by a sequence similarity 
search with the program PowerBlast (103). The 
two largest data sets in dbSNP are the Kwok 
and TSC sets, with 47% and 25% of the dbSNP 
records. Low-quality alignments with partial 
coverage of the dbSNP sequence and align- 
ments that had less than 98% sequence identity 
between the Celera sequence and the dbSNP 
flanking sequence were eliminated. dbSNP se- 
quences mapping to multiple locations on the 
Celera genome were discarded. A total of 
2,336,935 dbSNP variants were mapped to 
* 223,038 unique locations on the Celera se- 
jence, implying considerable redundancy in 
)SNP. SNPs in the TSC set mapped to 
585,81 1 unique genomic locations, and SNPs in 
the Kwok set mapped to 438,032 unique loca- 
tions. The combined unique SNPs counts used 
in this analysis, including Celera-PFP, TSC, 
and Kwok, is 2,737,668. Table 15 shows that a 
substantial fraction of SNPs identified by one of 
these methods was also found by another meth- 
od. The very high overlap (36.2%) between the 
Kwok and Celera-PFP SNPs may be due in part 
to the use by Kwok of sequences that went into 
the PFP assembly. The unusually low overlap 
(16.4%) between the Kwok and TSC sets is due 

Table 15. Overlap of SNPs from genome-wide 
SNP databases. Table entries are SNP counts for 
each pair of data sets. Numbers in parentheses are 
the fraction of overlap, calculated as the count of 
overlapping SNPs divided by the number of SNPs 
in the smaller of the two databases compared 
Total SNP counts for the databases are: Celera- 
PFP, 2,104.820; TSC, S85,811; and Kwok 438,032 
Only unique SNPs in the TSC and Kwok data sets 
were included. 
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to their being the smallest two sets. In addition 
24.5% of the Celera-PFP SNPs overlap with 
SNPs derived from the Celera genome se- 
quences (46). SNP validation in population 
samples is an expensive and laborious process, 
so confirmation on multiple data sets may pro- 
vide an efficient initial validation "in silico" (by 
computational analysis). 

One means of assessing whether the 
three sets of SNPs provide the same picture 
of human variation is to . tally the frequen- 
cies of the six possible base -changes in' 
each set of SNPs (Table 16). Previous mea- 
sures- of nucleotide diversity were mostly 
derived from small-scale analysis on can- 
didate genes (J 01), and our analysis with 
all three data sets validates the previous 
observations at the whole-genome scale. - 
There is remarkable homogeneity between 
•the SNPs. found in the Kwok set, the TSC 
set, and in our whole-genome shotgun (46) 
in. this substitution pattern. Compared with 
the rest of the data sets, Celera-PFP devi- 
ates slightly from the 2:1 transition-to- 
transversion ratio observed in the other 
SNP sets. This result is not unexpected, 
because some fraction of the computation- 
ally identified SNPs in the Celera-PFP 
comparison may in fact be sequence errors. 
A 2:1 transitiomtransversion ratio for the 
bona fide SNPs would be obtained if one 
assumed that 15% of the sequence differ- 
ences in the Celera-PFP set were a result of 
(presumably random) sequence errors. 

6.3 Estimation of nucleotide diversity 
from ascertained SNPs 

The number of SNPs identified varied 
widely across chromosomes. In order to 
normalize these values to the chromosome 
size and sequence coverage, we used ir, the 
standard statistic for nucleotide diversity 
(104). Nucleotide diversity is a measure, of 
per-site heterozygosity, quantifying the 
probability that a pair of chromosomes- 
drawn from the population will differ at a 
nucleotide site. In order to calculate nucle- 
otide diversity for each chromosome, we 
need to know the number of nucleotide 
sites that were surveyed for variation, and 
in methods like reduced respresentation se- 
quencing, we need to know the sequence 
quality and the depth of coverage at each 



site. These data are not readily available so 
we could not estimate nucleotide diversity 
from the TSC effort. Estimation of nucleo- 
tide diversity from high-quality sequence 
overlaps should be possible, but again 
more information is needed on the details 
of all the alignments. 

Estimation of nucleotide diversity from a 
shotgun assembly entails calculating for each 
; - column qf the multialignment, the probability 
, that, two or : more distinct alleles are present 
: and the probability of detecting a SNP if in 
: fact the alleles have different sequence (i.e 
the probability of correct sequence calls) The 
greater the depth of coverage and the higher 
the sequence quality, the higher is the chance 
of successfully detecting a SNP (105). Even 
after correcting for variation in coverage, the 
nucleotide diversity appeared to vary across 
autosomes. The significance of this heteroge- 
neity was tested by analysis of variance, with 
estimates of it for 100-kbp windows to esti- 
mate variability within chromosomes (for the 
Celera-PFP comparison, F = 29 73 P < 
0.0001). ' J 

Average diversity for the autosomes es- 
timated from the Celera-PFP comparison 
was 8.94 X lO" 4 . Nucleotide diversity on 
the X chromosome was 6.54 X 10" 4 . The 
X is expected to be less variable than au- 
tosomes, because for every four copies of 
autosomes in the population, there are only 
three X chromosomes, and this smaller ef- 
fective population size means that random 
drift will more rapidly remove variation 
from the X (106). 

Having ascertained nucleotide variation 
genome-wide, it appears that previous esti- 
mates of nucleotide diversity in humans 
based on samples of genes were reasonably 
accurate (101, 102, 106, 107). Genome-wide, 
our estimate of nucleotide diversity was 
8.98 X 10- 4 for the Celera-PFP alignment, 
and a published estimate averaged over 10 
densely resequenced human genes was 
8.00 X 10" 4 (108). 



6.4 Variation in nucleotide diversity 
across the human genome 

Such an apparently high degree of variabil- 
ity among chromosomes . in SNP density 
raises the question of whether there is het- 
erogeneity at a finer scale within chromo- 
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Rg. 13. Segmental duplica- 
tions between chromo- 
somes In the human ge- 
nome. The 24 panels show 
the 1077 duplicated blocks 
of genes, containing 10310 
pairs of genes in totaL Each 
line represents a pair of ho- 
mologous genes belonging 
to a block; all blocks con- 
tain at least three genes 
on each of the chromo- 
somes where they appear. 
Each panel shows all the 
- duplications between a' 
single chromosome and 
other chromosomes with 
shared blocks. The chro- 
mosome at the center of 
each panel is shown as a 
thick red line for emphasis. 
Other chromosomes are 
displayed from top to bot- 
. torn within each panel or- 
dered by chromosome 
number. The inset (bot- 
tom, center right) shows a 
close-up of one duplica- 
tion between chromo- 
somes 18 and 20, expand- 
ed to display the gene 
names of 12 of the 64 
gene pairs shown. 
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somes, and whether this heterogeneity is 
greater than expected by chance. If SNPs 
occur by random and independent mutations, 
then it would seem that there ought to be a 
Poisson distribution of numbers of SNPs in 
fragments of arbitrary constant size. The ob- 
served dispersion in the distribution of SNPs 
in 100-kbp fragments was far greater than 
predicted from a Poisson distribution (Fig. 
14). However, this simplistic model ignores 
the different recombination rates and popula- 
tion histories that exist in different regions of 
the genome. Population genetics theory holds 
that we can account for this variation with a 
mathematical formulation called the neutral ■. 
coalescent (109). Applying well-tested algo- 
rithms for simulating the neutral coalescent \ 
with recombination (110), and using an ef- 
fective population size of 10,000 and a per- 
base recombination rate equal to the mutation 
rate (111), we generated a distribution of num- 
bers of SNPs by this model as well (112). The 
observed distribution of SNPs has a much larg- 
er variance than either the Poisson model or the 
coalescent model, and the difference is highly 
significant. This implies that there is significant 
variability across the genome in SNP density, 
an observation that begs an explanation. 

. Several attributes of the DNA sequence 
may affect the local density of SNPs, in- 
cluding the rate at which DNA polymerase 
makes errors and the efficacy of mismatch 
repair. One key factor that is likely to be 
associated with SNP density is the G+C 
content, in part because methylated cy- 
tosines in CpG dinucleotides tend to under- 
go deamination to form thymine, account- 
ing for a nearly 10-fold increase in the 
mutation rate of CpGs over other dinucle- 



otides. We tallied the GC content and nu- 
cleotide diversities in 100-kbp windows 
across the entire genome and found that the 
correlation between them was positive (r = 
0.21) and highly significant (P < 0.0001), 
but G+C content accounted for only a 
small part of the variation. 

6.5 SNPs by genomic class 

To test . homogeneity of SNP ■ densities 
across functional classes, we partitioned 
sites into intergenic (defined as >5 kbp 
from any predicted transcription unit), 5'- 
UTR, exonic (missense and silent), in- 
tronic, and 3'-UTR for 10,239 known 
genes, derived from the NCBI RefSeq da- 
tabase and all human genes predicted from 
the Celera Otto annotation. In coding re- 
gions, SNPs were categorized as either si- 
lent, for those that do not change amino 
acid sequence, or missense, for those that 
change the protein product. The ratio of 
missense to silent coding SNPs in Celera- 
PFP, TSC, and Kwok sets (1.12, 0.91, and 
0.78, respectively) shows a markedly re- 
duced frequency of missense variants com- 
pared with the neutral expectation, consis- 
tent with the elimination by natural selec- 
tion of a fraction of the deleterious amino 
acid changes (112). These ratios are com- 
parable to the missense-to-silent ratios of 
0.88 and 1.17 found by Cargill et al. (101) 
and by Halushka et al (102). Similar re- 
sults were observed in SNPs derived from 
Celera shotgun sequences (46). 

It is striking how small is the fraction of 
SNPs that lead to potentially dysfunctional 
alterations in proteins. In the 10,239 Ref- 
Seq genes, missense SNPs were only about 
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Fig. 14. SNP density in each 100-kbp interval as determined with Celera-PFP SNPs. The color codes 
are as follows: black. Celera-PFP SNP density; blue, coalescent model; and red, Poisson distribution. 
The figure shows that the distribution of SNPs along the genome is nonrandom and is not entirely 
accounted for by a coalescent model of regional history. 



0.12, 0.14, and 0.17% of the total SNP 
counts in Celera-PFP, TSC, and Kwok 
SNPs, respectively. Nonconservative pro- 
tein changes constitute an even smaller frac- 
tion of missense SNPs (47, 41, and 40% in 
Celera-PFP, Kwok, and TSC). Intergenic re- 
gions have been virtually unstudied (113), and 
we note that 75% of the SNPs we identified 
were intergenic (Table 17). The SNP rate was 
highest in introns and lowest in exons. The SNP 
rate was lower in intergenic regions than in 
introns, providing one of the first discriminators 
between these two classes of DNA. These SNP 
rates were confirmed in the Celera SNPs, which 
. also exhibited a lower rate in exons .than in 
introns, and in extragenic regions than in in- 
trons (46). Many of these intergenic SNPs will 
provide valuable information in the form of 
markers for linkage and association studies, and 
some fraction is likely to have a regulatory 
function as well. 

s 

7 An Overview of the Predicted 
Protein-Coding Genes in the Human 
Genome 

Summary. This section provides an initial 
computational analysis of the predicted 
protein set with the aim of cataloging 
prominent differences and similarities 
when the human genome is compared with 
other fully sequenced eukaryotic genomes. 
Over 40% of the predicted protein set in 
humans cannot be ascribed a molecular 
function by methods that assign proteins to 
known families. A protein domain-based 
analysis provides a detailed catalog of the 
prominent differences in the human ge- 
nome when compared with the fly and 
worm genomes. Prominent among these are 
domain expansions in proteins involved in 
developmental regulation and in cellular 
processes such as neuronal function, hemo- 
stasis, acquired immune response, and cy- 
toskeletal complexity. The final enumera- 
tion of protein families and details of pro- 
tein structure will rely on additional exper- 
imental work and comprehensive manual 
curation, 

A preliminary analysis of the predicted hu- 
man protein-coding genes was conducted. 
Two methods were used to analyze and clas- 
sify the molecular functions of 26,588 pre- 
dicted proteins that represent 26,383 gene 
predictions with at least two lines of evidence 
as described above. The first method was 
based on an analysis at the level of protein 
families, with both the publicly available 
Pfam database (114, 115) and Ceteris Pan- 
ther Classification (CPC) (Fig. 15) (116). 
The second method was based on an analysis 
at the level of protein domains, with both the 
Pfam and SMART databases (115, 117). 

The results presented here are prelimi- 
nary and are subject to several limitations. 
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Both the gene predictions and functional 
assignments have been made by using com- 
putational tools, although the statistical 
^^lodels in Panther, Pfam, and SMART have 
^Bpen built, annotated, and reviewed by ex- 
^^ert biologists. In the set of computationally 
predicted genes, we expect both false-positive 
predictions (some of these may in fact be inac- 
tive pseudogenes) and false-negative predic- 
■ tions (some human genes will not be.computa- 
tionaliy predicted). We also, expect errors in. 
delimiting the boundaries of exons and genes. 
Similarly, in the automatic functional assign- 
ments, we also expect both false-positive and 
false-negative predictions. The functional as- 
signment protocol focuses on protein families 
that tend to be found across several organisms, 
or on families of known human genes. There- 
fore, we do not assign a function to many genes 
that are not in large families, even if the func- 
tion is known. Unless otherwise specified, all 
enumeration of the genes in any given family or 
functional category was taken from the set of. 
26,588 predicted proteins, which were assigned 
functions by using statistical score cutoffs de- 
fined for models in Panther, Pfam, and 
SMART. 

For this initial examination of the pre- 
dicted human protein set, three broad ques- 
tions were asked: (i) What are the likely 
molecular functions of the predicted gene 
products, and how are these proteins cate- 
gorized with current classification meth- 
(ii) What are the core functions that 
^^pear to be common across the animals? 
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(iii) How does the human protein comple- 
ment differ from that of other sequenced 
eukaryotes? 

7.1 Molecular functions of predicted 
human proteins 

Figure 15 shows an overview of the puta- 
tive molecular functions of the predicted 
26,588 human proteins that have at .least 
two lines of supporting evidence. About 
4i% (12,809) of the; gene products could 
not be classified, from this initial analysis 
and are termed proteins with unknown 
functions. Because our automatic classifi- 
cation methods treat only relatively large 
protein families, there are a number of 
"unclassified" sequences that do, in fact, 
have a known or predicted function. For the 
60% of the protein set that have automatic 
functional predictions, the specific protein 
functions have been placed into broad 
classes. We focus here on molecular func- 
tion (rather than higher order cellular pro- 
cesses) in order to classify as many proteins 
as possible. These functional predictions 
are based on similarity to sequences of 
known function. 

In our analysis of the 12,731 additional low- 
confidence predicted genes (those with only 
one piece of supporting evidence), only 636 
(5%) of these additional putative genes were 
assigned molecular functions by the automated 
methods. One-third of these 636 predicted 
genes represented endogenous retroviral pro- 
teins, further suggesting that the majority of 



these unknown-function genes are not real 
genes. Given that most of these additional 
,. 12,095 genes appear to be unique among the 
genomes sequenced to date/many may simply 
• represent false-positive gene predictions. 

The most common molecular functions are 
the transcription factors and those involved in 
nucleic acid metabolism (nucleic acid enzyme). 
. Other. functions that are highly represented in 
..the human, genome are the receptors, kinases, . 
■ and hydrolases. Not surprisingly, • most of the 
hydrolases are proteases. There are also many 
proteins that are members of proto-oncogene 
famihes, as well as families of "select regula- 
tory molecules": (i) proteins involved in specif- 
ic steps of signal transduction such as hetero- 
trimeric GTP-binding proteins (G proteins) and 
cell cycle regulators, and (ii) proteins that mod- 
ulate the activity of kinases, G proteins, and 
phosphatases. 



Table 17. Distribution of SNPs in classes of 
genomic regions. 



Genomic region 
class 


Size of 
region 
examined 
(Mb) 


Celera-PFP 
SNP 
density 
(SNP/Mb) 


Intergenic 


2185 


707 


Gene (intron + 


646 


917 


exon) 






Intron 


615 


921 


First intron 


164 


808 


Exon 


31 


• 529 


First exon 


10 


592 
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nucleic acid enzyme (2308, 7.5%) 



signaling molecule (376, 1.2%) , 



receptor (1 543, 5.0%) . 



kinase (868, 2.8%) 



select regulatory molecule (988, 3.2%) 



transferase (610, 2.0%) 
synthase and synthetase (31 3. 1 .0%) ^ 

oxidoreductase (656, 2.1%) ^ / 
lyase (1 17, 0.4%)/ / 

ligasc(56.0.2%K 
isomcrasc(l63,0.5%) 
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chaperone ( 1 59, 0.5%) 

eytoskclctal structural protein (876. 2.8%) 
extraccttihr matrix (437, 1.4%) 
immimoglobulin (264, 0.9%) 
/ /ton channel (406, 1 3%) 
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- structural protein of muscle (296, 1 .0%) 
protooncogene (902, 2.9%) ' 

select calcium binding protein (34. 0.1%) 

— intracellular transporter (350, 1.1%) 

— transporter (533j 1.7%) 




.1 



^^GO categories 



x mc4ccuhr Inaction unknown ( 1 2809. 4 1 .7%) 



Fig. 15. Distribution 
of the molecular 
functions of 26,383 
human genes. Each 
slice lists the num- 
bers and percentages 
(in parentheses) of 
human gene functions 
assigned to a given 
category of molecular 
function. The outer cir- 
cle shows the assign- 
ment to molecular 
function categories in 
the Gene" Ontology 
(GO) (779), and the 
inner circle shows 
the assignment to 
Celera's Panther mo- 
lecular function cate- 
gories (776). 



Panther categories 
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7.2 Evolutionary conservation of core 
processes 

Because of the various "model organism" 
genome-sequencing projects that have al- 
ready been completed, reasonable compara- 
tive information is available for beginning the 
analysis of the evolution of the human ge- 
nome. The genomes of S. cerevisiae ("bak-. 
ers' yeast") {118) and two diverse inverte- 
brates, C. elegans (a nematode worm) (119) 
and Z). melanogaster (fly) (26)± as well as the 
first plant genome, A. thaliana, recently com- 
pleted (92), provide a diverse background for 
genome comparisons. 

We enumerated the "strict orthologs" con- 
served between human and fly, and between 
human and worm (Fig. 16) to address the 
question, What are the core functions that 
appear to be common across the animals? 
The concept of orthology is important be- 
cause if two genes are orthologs, they can be 
traced by descent to the common ancestor of 
the two organisms (an "evolutionarily con- 
served protein set"), and therefore are likely 
to perform similar conserved functions in the 
different organisms. It is critical in this anal- 
ysis to separate orthologs (a gene that appears 
in two organisms by descent from a common 
ancestor) from paralogs (a gene that appears 
in more than one copy in a given organism by 
a duplication event) because paralogs may ; 
subsequently diverge in function. Following 
the yeast-worm ortholog comparison in 
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(120), we identified two different cases for 
each pairwise comparison (human-fly and 
human-worm). The first case was a pair of 
genes, one from each organism, for which 
there was no other close homolog in either 
organism. These are straightforwardly identi- 
fied as orthologous, because there are no 
■ additional members of the families that com- 
plicate separating orthologs from paralogs. 
The second case is a family of genes with 
more than one member in either or both of the 
organisms being compared. Chervitz et ah 
(120) deal with this case by analyzing a 
phylogenetic tree that described the relation- 
ships between all of the sequences in both 
. organisms, and then looked for pairs of genes 
.that were nearest neighbors in the tree. If the 
nearest-neighbor pairs were from different 
organisms, those genes were presumed to be 
orthologs. We note that these nearest neigh- 
bors can often be confidently identified from 
pairwise sequence comparison without hav- 
ing to examine a phylogenetic tree (see leg- 
end to Fig. 16). If the nearest neighbors are 
not from different organisms, there has been 
. a paralogous expansion in one or both organ- 
isms after the speciation event (and/or a gene 
loss by one organism). When this one-to-one 
correspondence is lost, defining an ortholog 
becomes ambiguous. For our initial compu- 
tational overview of the predicted human pro- 
tern set, we could not answer this question for 
every predicted protein. Therefore, we con- 




sider only "strict orthologs," i.e., the proteins 
with unambiguous one-to-one relationships 
(Fig. 16). By these criteria, there are 2758 
strict human-fly orthologs, 2031 human- 
worm (1523 in common between these sets). 
We define the evolutionarily conserved set as 
those 1523 human proteins that have strict 
orthologs" in both ,D, .melanogaster and C. 
elegans. \ 

The distribution of the functions of the 
conserved protein set is shown in Fig. 16. 
Comparison with Fig. 15 shows that, not 
surprisingly, the set of conserved proteins is 
.not distributed among molecular functions in 
the same way as the whole human protein set. 
Compared with the whole human set (Fig. 
15), there are several categories that are over- 
represented in the conserved set by a factor of 
~2 or more. The first category is nucleic acid 
enzymes, primarily the transcriptional ma- 
chinery (notably DNA/RNA methyltrans- 
ferases, DNA/RNA polymerases, helicases, 
DNA ligases, DNA- and RNA -processing 
factors, nucleases, and ribosomal proteins). 
The basic . transcriptional and translational 
machinery is well known to have been con- 
served over evolution, from bacteria through 
to the most complex eukaryotes. Many ribo- 
nucleoproteins involved in RNA splicing also 
appear to be conserved among the animals. 
Other enzyme types are also overrepresent- 
ed (transferases, oxidoreductases, ligases, 
lyases, and isomerases). Many of these en- 



Fig. 16. Functions of putative 
orthologs across vertebrate 
and invertebrate genomes. 
Each slice lists the number and 
percentages (in parentheses) 
of "strict orthologs 1 ' between 
the human, fly, and worm ge- 
nomes involved in a given cat- 
egory of molecular function. 
"Strict orthologs" are defined 
here as bi-directional BLAST 
best hits (780) such that each 
orthologous pair (i) has a 
BIASTP P-value of ^KT 10 
(720), and (ii) has a more sig- 
nificant BLASTP score than 
any paralogs in either organ- 
ism, i.e., there has likely been 
no duplication subsequent to 
speciation that might make 
the orthology ambiguous. This 
measure is quite strict and is a 
lower bound on the number of 
orthologs. By these criteria, 
there are 2758 strict human- 
fly orthologs, and 2031 hu- 
man-worm orthologs (1523 in 
common between these sets). 
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zymes are involved in intermediary metabo- 
lism. The only exception is the hydrolase 
category, which is not significantly overrep- 
resented in the shared protein set. Proteases 
T>rm the largest part of this category, and 
■everal large protease families have expanded 
in each of these three organisms after their 
divergence. The category of select regulatory 
molecules is also overrepresented in the con- 
served set. -The major conserved families are 
• small guanosine triphosphatases (GTPases) 
(especially the Ras-related superfamily, in- 
cluding ADP ribosylation factor) and cell 
cycle regulators (particularly the cullin fam- 
ily, cyclin C family, and several cell division 
protein kinases). The last two significantly 
overrepresented categories are protein trans- 
port and trafficking, and chaperones. The 
most conserved groups in these categories are 
proteins involved in coated vesicle-mediated 
transport, and chaperones involved in protein 
folding and heat-shock response [particularly 
the DNAJ family, and heat-shock protein 
60 (HSP60), HSP70, and HSP90 families]. 
These observations provide only a conserva- 
tive estimate of the protein families in the 
context of specific cellular processes that 
were likely derived from the last common 
ancestor of the human, fly, and worm. As 
stated before, this analysis does not provide a 
complete estimate of conservation across the 
three animal genomes, as paralogous dupli- 
cation makes the determination of true or- 
^Blogs difficult within the members of con- 
^^ffved protein families. 

7.3 Differences between the human 
genome and other sequenced 
eukaryotic genomes 
To explore the molecular building blocks of 
the vertebrate taxon, we have compared the 
human genome with the other sequenced 
eukaryotic genomes at three levels: molec- 
ular functions, protein families, and protein 
domains. 

Molecular differences can be correlated 
with phenotypic differences to begin to reveal 
the developmental and cellular processes that 
are unique to the vertebrates. Tables 18 and 
19 display a comparison among all sequenced 
eukaryotic genomes, over selected protein/ 
domain families (defined by sequence simi- 
larity, e.g., the serine-threonine protein ki- 
nases) and superfamilies (defined by shared 
molecular function, which may include sev- 
eral sequence-related families, e.g., the cyto- 
kines). In these tables we have focused on 
(super) families that are either very large or 
that differ significantly in humans compared 
with the other sequenced eukaryote genomes. 
We have found that the most prominent hu- 

« expansions are in proteins involved in (i) 
lired immune functions; (ii) neural devel- 
ent, structure, and functions; (iii) inter- 
cellular and intracellular signaling pathways 
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in development and homeostasis; (iv) hemo- 
stasis; and (v) apoptosis. 

Acquired immunity* One . of the most 
striking differences between the human ge- 
nome and the Drosophila or C. elegans ge- 
nome is the appearance of genes involved in 
. acquired immunity (Tables 18 and 19). This 
is expected, because the acquired immune 
response is a defense system that only occurs 
in vertebrates. We observe 22 class' I and 22 
,;. class. II major.:histocompatibility" complex 
(MHC) antigen genes and 114 other immu- 
noglobulin genes in the human genome. In 
addition, there are 59 genes in the cognate 
immunoglobulin receptor family. At the do- 
main level, this is exemplified by an expan- 
sion and recruitment of the ancient immuno- 
■ globulin fold to. constitute molecules such as 
MHC, and of the in'tegrin fold to form several . 
of the cell adhesion molecules that mediate 
interactions between immune effector cells 
and the extracellular matrix. Vertebrate-spe- 
cific proteins include the paracrine immune 
regulators family, of secreted 4-aIpha helical 
bundle proteins, namely the cytokines and 
chemokines. Some of the cytoplasmic signal 
transduction components associated with cy- 
tokine receptor signal transduction are also 
features that are poorly represented in the fly 
and worm. These include protein domains 
found in the signal transducer and activator of 
transcription (STATs), the suppressors of cy- 
tokine signaling (SOCS), and protein inhibi- 
tors of activated STATs (PIAS). In contrast, 
many of the animal-specific protein domains 
that play a role in innate immune response, 
such as the Toll receptors, do not appear to be 
significantly expanded in the human genome. 
Neural development, structure, and 
• function. In the human genome, as compared 
with the worm and fly genomes, there is a 
marked increase in the number of members 
of protein families that are involved in 
neural development. Examples include neu- 
rotrophic factors such as ependymin, nerve 
growth factor, and signaling molecules 
such as semaphorins, as well as the number 
of proteins involved directly in neural 
structure and function such as myelin pro- 
teins, voltage-gated ion channels, and syn- 
aptic proteins such as synaptotagmin. 
These observations correlate well with the 
known phenotypic differences between the 
nervous systems of these taxa, notably (i) 
the increase in the number and connectivity 
of neurons; (ii) the increase in number of 
distinct neural cell types (as many as a 
thousand or more in human compared with 
a few hundred in fly and worm) (121); (iii) 
the increased length of individual axons; 
and (iv) the significant increase in glial cell 
number, especially the appearance of my- 
elinating glial cells, which are electrically 
inert supporting cells differentiated from 
the same stem cells as neurons. A number 



of prominent protein expansions are in- 
volved in the processes of neural develop- 
ment. Of the extracellular domains that me- 
diate cell adhesion, the connexin domain- 
containing proteins (122) exist only in hu- 
mans. These proteins, which are not present 
in the Drosophila or C. elegans genomes, 
appear to provide the constitutive subunits 
of intercellular channels and the structural 
basis for electrical coupling.; Pathway find- 
ing by axons and neuronal network forma- 
tion is mediated through a subset of ephrins 
and their cognate receptor tyrosine kinases 
that act as positional labels to establish 
topographical projections (123). The prob- 
able biological role for the semaphorins (22 
in human compared with 6 in the fly and 2 
in the worm) and their receptors (neuropi- 
lins and plexins) is that of axonal guidance 
molecules (124). Signaling molecules such 
as neurotrophic factors and some cytokines 
have been shown to regulate neuronal cell 
survival, proliferation, and axon guidance 
(125). Notch receptors and ligands play 
important roles in glial cell fate determina- 
tion and gliogenesis (126). 

Other human expanded gene families play 
key roles directly in neural structure and 
function. One example is synaptotagmin (ex- 
panded more than twofold in humans relative 
to the invertebrates), originally found to reg- 
ulate synaptic transmission by serving as a 
Ca 2+ sensor (or receptor) during synaptic 
vesicle fusion and release (127). Of interest is 
the increased co-occurrence in humans of 
PDZ and the SH3 domains in neuronal- 
specific adaptor molecules; examples include 
proteins that likely modulate channel activity 
at synaptic junctions (128).. We also noted 
expansions in several ion-channel families 
(Table 19), including the EAG subfamily 
(related to cyclic nucleotide gated channels), 
the voltage-gated calcium/sodium channel 
family, the inward-rectifier potassium chan- 
nel family, and the. voltage-gated potassium 
channel, alpha subunit family. Voltage-gated 
sodium and potassium channels are involved 
in the generation of action potentials in neu- 
rons. Together with voltage-gated calcium 
channels, they also play a key role in cou- 
pling action potentials to neurotransmitter re- 
lease, in the development of neurites, and in 
short-term memory. The recent observation 
of a calcium-regulated association between 
sodium channels and synaptotagmin may 
have consequences for the establishment and 
regulation of neuronal excitability (129). 

Myelin basic protein and my el in-associat- 
ed glycoprotein are major classes of protein 
components in both the central and peripheral 
nervous system of vertebrates. Myelin P0 is a 
major component of peripheral myelin, and 
myelin proteolipid and myelin oligodendro- 
cyte glycopotein are found in the central 
nervous system. Mutations in any of these 
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Table 18. Domain-based comparative analysis of proteins in H. sapiens (H) 
D. melanogaster (F), C. elegans (W), S. cerevisiae (Y). and A thaliana (A). The 
predicted protein set of each of the above eukaryotic organisms was analyzed 
with Pfam version 5.5 using E value cutoffs of 0.001. The number of proteins 
containing the specified Pfam domains as well as the total number of domains 
(in parentheses) are shown in each column. Domains were categorized into 
cellular processes for presentation. Some domains (i.e., SH2) are listed in 



™^ r c H ess * Results of the pfam anaI y sis ™y *™ 

Sionc ^M d ,° n hUman Curation of P* rotein families - owing to the 
limitations of large-scale automatic classifications. Representative examples 

t°hif a X^ ,th C ° U u ntS ° Wlng t0 the strin * ent E value cutoff usSo 

this analysis are marked with a double asterisk (•*). Examples include short 
divergent and predominantly aipha-helical domains, and certain classes of 
cysteine-nch zinc finger proteins. 



Accession 
number 



Domain name 



Domain description 



H 



W 



PF02039 

PF00212 

PF00028 

PF00214 

PF01110 

PF01093 

PF00029 

PF00976 

PF00473 

PF00007 

PF00778 

PF00322 

PF00812 

PF01404 

PF00167 

PF01534 

PF00236 

PF01153 

PF01271 

PF02058 

PF00049 

PF00219 

PF02024 

PF00193 

PF00243 

PF02158 

PF00184 

PF02070 

PF00066 

PF00865 

PF00159 

PF01279 

PF00123 

PF00341 

PF01403 

PF01033 

PF00103 

PF02208 

PF02404 

PF01034 

PF00020 

PF00019 

PF01099 

PF01160 

PF00110 

PF01821 
PF00386 
PF00200 
PF00754 
PF01410 
PF00039 
PF00040 
PF00051 
PF01823 
PF00354 
PF00277 
PF00084 
PF02210 
PF01108 
PF00868 
PF00927 



Adrenomedullin 
ANP 
Cadherin 
Calc.CCRPJAPP 
CNTF 
-..Clusterin . 
Connexin 
ACTH_domain 
• CRF 
Cysjcnot 
DiX 

Endothelin 

Ephrin 

EPhJbd 

FGF 

Frizzled 

Hormone6 

Clypican 

Granin 

Guanylin 

Insulin 

IGFBP 

Leptin 

Xlink 

NGF 

Neuregulin 
HormoneS 
NMU 
Notch 

Osteopontin 

Hormone3 

Parathyroid 

Hormone2 

PDGF 

Sema 

Somatomedin_B 

Hormone 

Sorb 

SCF 

Syndecan 

TNFR_c6 

TGF-p 

Uteroglobin 

Opiods neuropep 

Wnt 

ANATO 
C1q 

Disintegrin 

F5_F8_type C 

COLFI 

Fn1 

Fn2 

Kringle 

MACPF 

Pentaxin 

SAA_proteins 

Sushi 

TSPN 

Tissuejac 

Transglutamin_N 

TransglutaminJI 



. Developmental and homeostatic 

Adrenomedullin 

Atrial natriuretic peptide 

Cadherin domain 

Calcitonin/CGRP/IAPP family 

Ciliary neurotrophic factor 

Clusterin 

Connexin 

Corticotropin ACTH domain 

Corticotropin-releasing factor family 

Cystine-knot domain 

Dix domain 

Endothelin family 

Ephrin 

Ephrin receptor ligand binding domain 
Fibroblast growth factor 
Frizzled/Smoothened family membrane region 
Glycoprotein hormones 
Glypican 

Grainin (chromogranin or secretogranin) 

Guanylin precursor 

Insulin/IGF/Relaxin family 

Insulin-like growth factor binding proteins 

Leptin 

LINK (hyaluron binding) 
Nerve growth factor family 
Neuregulin family 
Neurohypophysial hormones 
Neuromedin U 
Notch (DSL) domain 
Osteopontin 

Pancreatic hormone peptides 
Parathyroid hormone family 
Peptide hormone 

Platelet-derived growth factor (PDGF) 
Sema domain 
Somatomedin B domain 
Somatotropin 

Sorbin homologous domain 

Stem cell factor 

Syndecan domain 

TNFR/NGFR cysteine-rich region 

Transforming growth factor p-like domain 

Uteroglobin family 

Vertebrate endogenous opioids neuropeptide 
Wnt family of developmental signaling proteins 

Hemostash 

Anaphylotoxin-like domain 

Clq domain 

Disintegrin 

F5/8 type C domain 

Fibrillar collagen C-terminal domain 

Fibronectih type I domain 

Fibronectin type II domain 

Kringle domain 

MAC/Perforin domain 

Pentaxin family 

Serum amyloid A protein 

Sushi domain (SCR repeat) 

Thrombospondin N-terminal-like domains 

Tissue factor 

Transglutaminase family 

Transglutaminase family 
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3 
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3 
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1 
2 

10(11) 
5 

7(8) 
12 
23 
9 
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14 
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1 
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0 
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Table 18 (Continued) 



Accession 
number 



Domain name 
Cla 



Domain description 



W 



' PF00594 



PF0071 1 


DefensinJ>eta 


PF00748 


Calpainjnhib 


PF00666 


Cathelicidins 


. PF00129 


.MHCJ 


PF00993 


MHCJI.atpha** 


PF00969 


MHC_ll_beta** 


rrUUo/y 


Defensin_propep 


PF01109 


GM_CSF 


PF00047 


lg 


PF00143 


Interferon 


PF00714 


IFN-gamma 


PF00726 


IL10 


PF02372 


IL15 


DCrtA71C 

rrUU/ Id 


ii *) 
llZ 


PF00727 


IL4 


PF02025 


IL5 


PF01415 


IL7 


Pr00340 


IL1 


PF02394 


IL1_propep 


PF02059 


113 


PF004S9 


II A 


PF01291 

rrU It j 1 


t ic new 


PF00323 


uerensins 


PF01091 


DTM 


PF00277 


SAA_proteins 


PF00048 


IL8 


|PF01582 


TIR 


§F00229 


TNF 


*PF00088 


trefoil 


PF00779 


BTK 


PF00168 


C2 


PF00609 


DAGKa 


PF00781 


DAGKc 


PF00610 


DEP 



PF01363 
PF00996 
PF00503 
PF00631 
PF00616 
PF00618 

PF00625 
PF02189 
PF00169 
PF00130 

PF00388 

PF00387 

PF0064O 
PF02192 
PF00794 
PF01412 
PF02196 
PF02145 
PF00788 

J»F00071 
§00617 
F00615 

">F02197 



FYVE 
GDI 

G-alpha 
G-gamma 
RasGAP 
RasGEFN 

Guanylatejcin 

ITAM 

PH 

DAG_PE-bind 
PI-PLC-X 
PI-PLC-Y 
PID 

PI3K_p85B 
PI3K_rbd 
ArfGAP 
RBD 

Rap.GAP 

RA 

Ras 

RasGEF 

RGS 

Rita 



Vitamin K-dependent carboxylation/gamma- 
carboxyglutamic (GLA) domain 

immune response 

Beta defensin 
. Calpatn Inhibitor repeat . 
Cathelicidins ' 

Class I histocompatibility antigen, domains alpha i 

and 2 ....... 

Class li histocompatibility antigen, alpha domain 
Class II histocompatibility antigen, beta domain 
Defensin propeptide 

Granulocyte-macrophage colony-stimulating factor 

Immunoglobulin domain 

Interferon alpha/beta domain 

Interferon gamma 

lnterteukin-10 

lnterleukin-15 

lnterleukin-2 

Interleukin-4 

lnterleukin-5 

lnterleukin-7/9 family 

lnterleukin-1 

lnterleukin-1 propeptide 

lnterleukin-3 

lnterleukin-6/G-CSF/MGF family 

Leukemia inhibitory factor (LlF)/oncostatin (OSM) 

family 
Mammalian defensin 
PTN/MK heparin-binding protein 
Serum amyloid A protein 
Small cytokines (intecrine/chemokine), 

interleukin-8 like 
TIR domain 

TNF (tumor necrosis factor) family 
Trefoil (P-type) domain 

Pl-PY-rho GTPase signaling 

BTK motif 
C2 domain 

Diacylglycerol kinase accessory domain (presumed) 
Diacylglycerol kinase catalytic domain (presumed) 
Domain found in Dishevelled, Egl-10, and 

Pleckstrin (DEP) 
FYVE zinc finger 
GDP dissociation inhibitor 
G-protein alpha subunit 
G-protein gamma like domains 
GTPase-activator protein for Ras-like GTPase 
Guanine nucleotide exchange factor for Ras-like 

GTPases; N-terminal motif 
Guanylate kinase 

Immunoreceptor tyrosine-based activation motif 
PH domain 

Phorbol esters/diacylglycerol binding domain (C1 
domain) 

Phosphatidylinositol-spedfic phospholipase C X 
domain 

Phosphatidylinositol-specific phospholipase C, Y 
domain 

Phosphotyrosine interaction domain (PTB/PID) 
PI3-kinase family, p85-binding domain 
PI3-kinase family, ras-binding domain 
Putative GTP-ase activating protein for Arf 
Raf-like Ras-binding domain 
Rap/ran-GAP 

Ras association (RalGDS/AF-6) domain 
Ras family 
RasGEF domain 

Regulator of G protein signaling domain 
Regulatory subunit of type II PKA R-subunit 



11 


0 


0 


0 


0 


1 


0 


0 


0 


0 


. . 3(9) 


0 


0 


0 


0 


2 


0 


o 


. 0 


' 0 


t?(20) 


- . . : . o 


:° 




V ."0 


5(6) 


0 


0 


0 


0 


7 


0 


0 


0 


0 


3 


0 


0 


0 


0 


1 


0 


0 


0 


0 


381 (930J 


125 (291) 


67(323) 


0 


0 


7(9) 


0 


0 


0 


0 


1 


0 


0 


0 


0. 


1 


0 


0 


0 


0 


1 


0 


0 


0 


0 


1 


n 
\j 


0 


0 


0 


1 


0 


0 


0 


0 


1 


0 


0 


0 


0 


1 


0 


0 


0 


0 


7 


0 


0 


0 


0 


1 


o 


u 


0 


^ 0 


1 


0 


0 


0 


0 


2 


0 


0 


0 


0 


2 


0 


0 


0 


0 


2 


0 


0 


0 


0 


2 


0 


0 


0 


0 


A 
*f 


0 


0 


0 


0 


32 


0 


0 


0 


0 


18 


8 


2 




131 f 1 Al\ 

131 (143) 


12 


o 


0 


b 


0 


5(6) 


0 


2 


0 


0 


5 


1 


0 


0 


0 


73 (101) 


32 (44) 


24(35) 


6(9) 


66 (90) 


9 


4 


7 


0 


6 


10 


o 
o 


o 
o 


2 


11(12) 


12(13) 


4 


10 


5 


2 


28 (30) 


14 


15 


5 


15 


6 


2 


1 


1 


3 


27 (30) 


10 




2 


5 


16 


5 


5 


1 


0 


11 


c 


o 
o 


3 


0 


9 


2 


3 


5 


0 


12 


8 


7 


1 


4 


3 


0 


0 


0 


0 


193 (212) 


72(78) 


65 (68) 


24 


23 


45(56) 


25(31) 


26(40) 


1(2) 


4 


12 


3 


7 


1 


8 


11 


2 


7 


1 


8 


24(27) 


13 


11(12) 


0 


0 


2 


1 


1 


0 


0 


6 


3 


1 


0 


0 


16 


9 


8 


6 


15 


6(7) 


4 


1 


0 


0 


5 


4 


2 


0 


0 


18(19) 


7(9) 


6 


1 


0 


126 


56(57) 


51 


23 


78 


21 


8 


7 


5 


0 


27 


6(7) 


12(13) 


1 


0 


4 


1 


2 


1 


0 
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Table 18 (Continued) 



The Human genome 



Accession 
number 



Domain name 



Domain description 



H 



W 



► 



PF00620 


RhoGAP 


PFC0621 


RhoGEF 


PF00536 


SAM 


PF01369 


Sec7 


PF00017 


SH2 


PF00018 


SH3 


PF01017 


STAT 


PF00790 


VHS 


PF00568 


WH1 


PF00452 


Bcl-2 


PF02180 


• BH4 


PF00619 


CARD 


PF00531 


Death 


PF01335 


DED 


PF02179 


BAG 


PF00656 


ICE_p20 


PF00653 


BIR 


PF00022 


Actin 


PF00191 


Annexin 




Calponin 


PF00373 


Band_41 


PF00880 


Nebulin_repeat 


PF00681 


Plectin__repeat 


PF00435 


Spectrin 


PF00418 


Tubulin-binding 


PF00992 


Troponin 


PF02209 


VHP ' 


PF01044 


Vinculin 


PF01391 


Collagen 


PF01413 


C4 


PF00431 


CUB 


PF00008 


EGF 


PF00147 


Fibrinogen^ 



PF00041 

PF00757 

PF00357 

PF00362 

PF00052 

PF00053 

PF00054 

PF00055 

PF00059 

PF01463 

PF01462 

PF00057 

PF00058 

PF00530 

PF00084 

PF00090 

PF00092 

PF00093 

PF00094 

PF00244 . 

PF00O23 

PF00514 

PF00168 

PF00027 

PF01556 

PF00226 

PF00036 

PF00611 

PF01846 

PF00498 



Fn3 

Furin-like 

lntegrin_A 

lntegrin_B 

Laminin_B 

Laminin_EGF 

Laminin_G 

Laminin_Nterm 

Lectin_c 

LRRCT 

LRRNT 

Ldl_recept_a 

LdLrecept b 

SRCR 

Sushi 

Tsp.l 

Vwa 

Vwc 

Vwd 

14-3-3 
Ank 

Armadillo_seg 
C2 

cNMPJ>inding 

DnaJ_C 

DnaJ 

Efhand** 

FCH 

FF 

FHA 



RhoGAP domain 
RhoGEF domain 

SAM domain (Sterile alpha motif) 
Sec7 domain 

Src homology 2 (SH2) domain 
Src homology 3 (SH3) domain 
STAT protein 
VHS domain 
WH1 domain 

Domains involved in aoootosis 
Bd-2 . • 

Bcl-2 homology region 4 
Caspase recruitment domain 
Death domain 
Death effector domain 
Domain present in Hsp70 regulators 
ICE-like protease (caspase) p20 domain 
Inhibitor of Apoptosis domain 

. Cytoskeietat 

Actin 
Annexin 
Calponin family 

FERM domain (Band 4.1 family) 
Nebulin repeat 
Plectin repeat 
Spectrin repeat 

Tau and MAP proteins, tubulin-binding 
Troponin 

Villin headpiece domain 
Vinculin family 

ECM adhesion 
Collagen triple helix repeat (20 copies) 
C-terminal tandem repeated domain in type 4 

procollagen 
CUB domain 
EGF-like domain 

Fibrinogen beta and gamma chains, C-terminal 

globular domain 
Fibronectin type III domain 
Furin-like cysteine rich region 
Integrin alpha cytoplasmic region 
Integrins. beta chain 
Laminin B (Domain IV) 
Laminin EGF-like (Domains III and V) 
Laminin G domain 
Laminin N-terminal (Domain VI) 
Lectin C-type domain 
Leucine rich repeat C-terminal domain 
Leucine rich repeat N-terminal domain 
Low-density lipoprotein receptor domain class A 
Low-density lipoprotein receptor repeat class B 
Scavenger receptor cysteine-rich domain 
Sushi domain (SCR repeat) 
Thrombospondin type 1 domain 
von Willebrand factor type A domain 
von Willebrand factor type C domain 
von Willebrand factor type D domain 

Protein interaction domains 

14-3-3 proteins 
Ank repeat 

Armadillo/beta-catenin-like repeats 
C2 domain . 1 
Cyclic nudeotide-binding domain 
DnaJ C terminal region 
DnaJ domain 
EF hand 

Fes/CIP4 homology domain 
FF domain 
FHA domain 



59 
46 
29(31) 
13 

. 87(95) 
143 (182) 
7 
4 
7 

' 9 
3 
16 
16 
4(5) 
5(8) 
11 
8(14) 

61(64) 
16(55) 
13(22) 
29 (30) 
4(148) 
2(11) 
31 (195) 
4(12) 
4 
5 
4 

65(279) 
6(11) 

47(69) 
108 (420) 
26 

106(545) 
5 
3 
8 

8(12) 
24(126) 
30(57) 
10 
47(76) 
69(81) 
40(44) 
35(127) 
15(96) 
11(46) 
53 (191) 
41 (66) 
34(58) 
19(28) 
15(35) 



20 

145 (404) 
22(56) 
73(101) 
26(31) 
12 
44 

83(151) 
9 

4(11) 
13 



19 
23(24) 
15 
5 

33(39) 
55(75) 
1 
2 
2 

2 
0 
0 
5 
0 
3 
7 

5(9) 



20 
18(19) 
8 
5 

44(48) 
46(61) 

K2) 
4 

2(3) 

1 
1 
2 
7 
0 
2 
3 

2(3) 



9 
3 
3 
5 

23(27) 
0 
4 
1 

0 
0 
0 
0 
0 
1 
0 

1(2) 



8 
0 
6 
9 
3 
4 
0 
8 
0 

0 
.0 
0 
0 
0 
5 
0 
0 



15(16) 


12 


9f11) 


24 


4(16) 


4(11) 


0 


6(16) 


3 


7(19) 


0 


0 


17(19) 


11(14) 


0 


o 


1(2) 


1 


0 


0 


0 


0 


0 


0 


13(171) 


10(93) 


0 


0 


1(4) 


2(8) 


0 


0 


6 


8 


0 


0 


2 


2 


0 


5 


2 


1. 


0 


0 


10(46) 


174(384) 


. 0 


0 


2(4) 


3(6) 


0 


0 


9(47) 


43(67) 


0 


0 


45 (186) 


54(157) 


0 


1 


10(11) 


6 


0 


0 


42 (168) 


34(156) 


0 


1 


2 


1 


0 


0 


1 


2 


0 


0 




2 


0 


0 


4(7) 


6(10) 


0 


0 


9(62) 


11(65) 


0 


0 


18(42) 


14(26) • 


0 


0 


6 


4 


0 


0 


23(24) 


91 (132) 


0 


0 


23(30) 


7(9) 


0 


0 


7(13) 


3(6) 


0 


0 


33(152) 


27(113) 


0 


0 


9(56) 


7(22) 


0 


0 


4(8) 


K2) 


0 


0 


11(42) 


8(45) 


0 


0 


11(23) 


18(47) 


0 


0 


0 


17(19) 


0 


1 


6(H) 


2(5) 


0 


0 


3(7) 


9 


0 


0 



3 

72 (269) 
11(38) 
32 (44) 
21 (33) 
9 
34 

64(117) 
3 

4(10) 
15 



3 

75 (223) 
3(11) 
24(35) 
15(20) 
5 
33 
41 (86) 

3(16) 
7 



2 

12(20) 
2(10) 
6(9) 
2(3) 
3 
20 
4(11) 
4 

2(5) 
13(14) 



15 

66(111) 
25(67) 
66(90) 
22 
19 
93 

120 (328) 
0 

4(8) 
17 
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myelin proteins result in severe demyelina- 
tion, which is a pathological condition in 
which the myelin is lost and the nerve con- 
duction is severely impaired (130). Humans 
ave at least 10 genes belonging to four 
different families involved in myelin produc- 



The Human genome 

ton (five myelin PO, three myelin proteolip- 
id, myelin basic protein, and myelin-oligo- 
dendrocyte glycoprotein, or MOG), and pos- 
sibly more-remotely related members of the 
MOG family. Flies have only a single myelin 
proteolipid, and worms have none at all. 



Intercellular and intracellular signaling 
pathways in development and homeostasis. 
Many protein families that have expanded in 
humans relative to the invertebrates are in- 
volved in signaling processes, particularly in 
response to development and differentiation 



Table 18 (Continued) 



Accession 
number . 



Domain name 



Domain description . 



H 



w . • 


" v - 

■ * 


" A 


7(13) 


4 


24(29) 


1 


0 


10 


13(41) 


3 


102(178) 


7(11) 


1 


15(16) 


88(161) 


1 


61 (74) 


6 


1 


13(18) 


46(66) 


2 


5 


65 (68) 


24 


23 


0 


1 


474 (2485) 


8 


3 


6 


5 


5 


9 


44(48) 




3 


46(61) 


23(27) 


4 


6 


2 


13 


28(54) 


16(31) 


65(124) 


72(153) 


56(121) 


167(344) 


16(24) 


5(8) 


11(15) 


10 


2 


10 



PF00254 
PF01590 
PF01344 
PF00560 
PF00917 
PF00989 
PF00595 
PF00169 
PF01535 
PF00536 
PF01369 
PF00017 
PF00018 
PF01740 
PF00515 
PF00400 
PF00397 
PF00569 

PF01754 
PF01388 
PF01426 
PF00643 
PF0O533 
PF00439 
^F00651 
r PF00145 
PF00385 

PF00125 
PF00134 
PF0O270 
PF01529 
PF00646 
PF00250 
PF00320 
PF01585 
PF00010 
PF00850 
PF00046 
PF01833 
PF02373 
PF02375 
PF00013 
PF01352 
PF00104 

PF00412 
PF00917 
PF00249 
PF02344 
PF01753 
PF00628 
PF00157 
PF02257 
PF00076 



PF02037 
00622 
01852 
PF00907 



FKBP 

CAF 

Kelch 

LRR** 

MATH 

PAS 

PDZ 

PH 

PPR** 

SAM 

Sec7 

SH2 

SH3 

STAS 

TPR** 

WD40** 

WW 

ZZ 

Zf-A20 

ARID 

BAH 

Zf-B_box** 
BRCT 

Bromodomain 
BTB 

DNA_methytase 
Chromo 

Histone 

Cyclin 

DEAD 

Zf-DHHC 

F-box** 

Forehead 

GATA 

G-patch 

HLH** 

Hist_deacetyl 

Homeobox 

TIG 

JmjC 

JmjN 

KH-domain 
KRAB 

Hormone_rec 

UM 
MATH 

Myb_DNA-binding 

Myc-LZ 

Zf-MYND 

PHD 

Pou 

RFXJ)NAJ>inding 
Rrm 

SAP 
SPRY 
START 
T-box 



FKBP-type peptidfyl-prolyl cis-trans isomerases 

GAF domain 

Kelch motif 

Leucine Rich Repeat 

MATH domain 

PAS domain 

PDZ domain (Also known as DHR or GLGF) 
PH domain 
PPR repeat 

SAM domain (Sterile alpha motif) 
Sec7 domain 

Src homology 2 (SH2) domain 
Src homology 3 (SH3) domain 
STAS domain 
TPR domain 
WD40 domain 
WW domain 

ZZ-Zinc finger present in dystrophin, CBP/p300 

Nuclear interaction domains 

'A20-O]ce zinc finger 
ARID DNA binding domain 
BAH domain 
B-box line finger 

BRCA1 C Terminus (BRCT) domain 

Bromodomain 

BTB/POZ domain 

C-5 cytosine-spedfic DNA methylase 
chromo' (CHRromatin Organization Modifier) 
domain 

Core histone H2A/H2B/H3/H4 
Cyclin 

DEAD/DEAH box helicase 
DHHC zinc finger domain 
F-box domain 
Fork head domain 
GATA zinc finger 
G-patch domain 

Helix-loop-helix DNA-binding domain 
Histone deacetylase family 
Homeobox domain 
IPT/TIG domain 
JmjC domain 
JmjN domain 
KH domain 
KRAB box 

Ligand-btnding domain of nuclear hormone 

receptor 
UM domain containing proteins 
MATH domain 

Myb-Oke DNA-binding domain 
Myc leucine zipper domain 
MYND finger 
PHD-finger • 

Pou domain — N-terminal to homeobox domain 
RFX DNA-binding domain 
RNA recognition motif (a.k.a. RRM, RBD, or RNP 

domain) 
SAP domain 
SPRY domain 
START domain 
T-box 



15(20) 
7(8) 
54(157) 
25(30) 
11 

18(19) 
96(154) 
193 (212) 
5 

29(31) 
13 

87(95) 
143(182) 
5 

72(131) 
136 (305) 
32(53) 
10(11) 

2(8) 
11 
8(10) 
32(35) 
17(28) 
37(48) 
97(98) 
3(4) 
24(27) 



7(8) 
2(4) 
12(48) 
24(30) 
5 

9(10) 
60(87) 
72(78) 
3(4) 
15 
5 

33 (39) 
55(75) 
1 

39 (101) 
98(226) 
24(39) 
13 

2 
6 

7(8) 
1 

10(18) 
16(22) 
62 (64) 
1 

14(15) 



2 
4 

4(5) 
2 

23(35) 
18(26) 
86(91) 
0 

17(18) 



0 
2 
5 
0 

10(16) 
10(15) 
1(2) 
0 

1(2) 



8 
7 

21 (25) 
0 

12(16) 
28 
30(31) 
13(15) 
12 



75(81) 


5 


71 (73) 


8 


48 


19 


10 


10 


11 


35 


63(66) 


48(50) 


55(57) 


50(52) 


84(87) 


15 


20 


16 


7 


22 


16 


15 


309 (324) 


9 


165(167) 


35(36) 


20(21} 


15 


4 


0 


11(17) 


5(6) 


8(10) 


9 


26 


18 


16 


13 


4 


14(15) 


60(61) 


44 


24 


4 


39 


12 


5(6) 


8(10) 


5 


10 


160(178) 


100(103) 


82 (84) 


6 


66 


29(53) 


11(13) 


5(7) 


2 


1 


10 


4 


6 


4 


7 


7 


4 


2 


3 


7 


28(67) 


14(32) 


17(46) 


4(14) 


27(61) 


204 (243) 


0 


0 


0 


0 


47 


17 


142(147) 


0 


0 


62(129) 


33(83) 


33(79) 


4(7) 


10(16) 


11 


5 


88(161) 


1 


61 (74) 


32(43) 


18(24) 


17(24) 


15(20) 


243 (401) 


1 


0 


0 


0 


0 


14 


14 


9 


1 


7 


68(86) 


40(53) 


32 (44) 


14(15) 


96 (105) 


15 


5 


4 


0 


0 


7 


2 


1 


1 


0 


224(324) 


127(199) 


94(145) 


43(73) 


232 (369) 


15 


8 


5 


5 


6(7) 


44(51) 


10(12) 


5(7) 


3 


6 


10 


2 


6 


0 


23 


17(19) 


8 


22 


0 


0 
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Table 18 (Continued) 



Accession 
number 



Domain name 



The Human Genome 



"Domain description h 



W 



PF02135 
PF01285 
PF02176 
PF00352 

PF00567 
PF00642 
PF00096 
PF00097 
PF00098 



Zf-TAZ 
TEA 

Zf-TRAF 
TBP 

TUDOR 

Zf-CCCH 

Zf-C2H2** 

Zf-C3HC4 

Zf-CCHC 



TAZ finger 
TEA domain 
TRAF-type zinc finger 

Transcription factor TFIID (or TATA-binding 

protein, TBP) 
TUDOR domain 

Zinc finger C-x8-C-x5-C-x3-H type (and similar) 

Zinc finger, C2H2 type 

Zinc finger, C3HC4 type (RING finger) 

Zinc knuckle 



2(3) 
4 

6(9) 
2(4) 

9(24) 
17(22) 
564(4500) 
135(137) 
9(17) 



H2) 
1 

K3) 
4(8) 

9(19) 
6(8) 
234(771) 
57 
6(10) 



6(7) 
1 

2(4) 

4(5) 
22(42) 
68(155) 
88 (89) 
17(33) 



0 


10(15) 


1 


0 


0 


2 


1(2) 


2(4) 


0 


2 


3(5) 


31(46) 


34(56) 


21 (24) 


18 


298(304) 


7(13) 


68(91) 



(Tables 18. and 19). They include secreted 
hormones and growth factors, receptors, in- 
tracellular signaling molecules, and transcrip- 
tion factors. 

Developmental signaling molecules that are 
enriched in the human genome include growth 
factors such as wnt, transforming growth fac- 
tor-^ (TGF-p), fibroblast growth factor (FGF), 
nerve growth factor, platelet derived growth 
factor (PDGF), and ephrins. These growth fac- 
tors affect tissue differentiation and a wide 
range of cellular processes involving actin-cy- 
toskeletal and nuclear regulation. The corre- 
sponding receptors of these developmental li- 
gands are also expanded in humans. For exam- 
ple, our analysis suggests at least 8 human 
ephrin genes (2 in the fly, 4 in the worm) and 1 2 
ephrin receptors (2 in the fly, 1 in the worm). In 
the wnt signaling pathway, we find 18 wnt 
family genes (6 in the fly, 5 in the worm) and 
12 frizzled receptors (6 in the fly, 5 in the 
worm). The Groucho family of transcriptional 
corepressors downstream in the wnt pathway 
axe even more markedly expanded, with 13 
predicted members in humans (2 in the fly, 1 in 
the worm). 

Extracellular adhesion molecules involved 
in signaling are expanded in the human genome 
(Tables 18 and 19). The interactions of several 
of these adhesion domains with extracellular 
matrix proteoglycans play a critical role in host 
defense, morphogenesis, and tissue repair 
(131). Consistent with the well-defined role of 
heparan sulfate proteoglycans in modulating 
these interactions (132) t we observe an expan- 
sion of the heparin sulfate sulfotransferases in 
the human genome relative" to worm and fly. 
These sulfotransferases modulate tissue differ- 
entiation (133). A similar expansion in humans 
is noted in structural proteins that constitute the 
actin-cytoskeletal architecture. Compared with 
the fly and worm, we observe an explosive 
expansion of the nebulin (35 domains per pro- 
tein on average), aggrecan (12 domains per 
protein on average), and plectin (5 domains per 
protein on average) repeats in humans. These 
repeats are present in proteins involved in mod- 
ulating the actin-cytoskeleton with predominant 
expression in neuronal, muscle, and vascular 
tissues. 



Comparison across the. five sequenced eu- 
karyotic organisms revealed several expand- 
ed protein families and domains involved in 
cytoplasmic signal transduction (Table 18). 
In particular, signal transduction pathways 
playing roles in developmental regulation and 
acquired immunity were substantially en- 
riched. There is a factor of 2 or greater ex- 
pansion in humans in the Ras superfamily 
GTPases and the GTPase activator and GTP 
exchange factors associated with them. Al- 
though there are about the same number of 
tyrosine kinases in the human and C. elegans 
genomes, in humans there is an increase in 
the SH2, PTB, and ITAM domains involved 
in phosphotyrosine signal transduction. Fur- . 
ther, there is a twofold expansion of phos- 
phodiesterases in the human genome com- 
pared with either the worm or fly genomes. 

The downstream effectors of the intracellu- 
lar signaling molecules include the transcription 
factors that transduce developmental fates. Sig- 
nificant expansions are noted in the ligand- 
binding nuclear hormone receptor class of tran- 
scription factors compared with the fly genome, 
although not to the extent observed in the worm 
(Tables 18 and 19). Perhaps the most striking 
expansion in humans is in the C2H2 zinc finger 
transcription factors. Pfam detects a total of 
4500 C2H2 zinc finger domains in 564 human 
proteins, compared with 771 in 234 fly proteins. 
This means that there has been a dramatic 
expansion not only in the number of C2H2 
transcription factors, but also in the number of 
these DNA-binding motifs per transcription 
factor (8 on average in humans, 3.3 on average 
in the fly, and 2.3 on average in the worm). 
Furthermore, many of these transcription fac- 
tors contain either the KRAB or SCAN do- 
. mains, which are not found in the fly or worm 
genomes. These domains are involved in the 
oligomerization of transcription factors and in- 
crease the combinatorial partnering of these 
factors. In general, most of the transcription 
factor domains are shared between the three 
animal genomes, but the reassortment of these 
domains results in organism-specific transcrip- 
tion factor families. The domain combinations 
found in the human, fly, and worm include the 
BTB with C2H2 in the fly and humans, and 
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homeodomains alone or in combination with 
Pou and LIM domains in all of the animal 
genomes. In plants, however, a different set of 
transcription factors are expanded, namely, the 
myb family, and a unique set that includes VP1 
and AP2 domain^ntaining proteins (134). 
The yeast genome has a paucity of transcription 
factors compared with the multicellular eu- 
karyotes, and its repertoire is limited to the 
expansion of the yeast-specific C6 transcription 
factor family involved in metabolic regulation. 

While we have illustrated expansions in a 
subset of signal transduction molecules in the 
human genome compared with the other eu- 
karyotic genomes, it should be noted that 
most of the protein domains are highly con- 
served. An interesting observation, is that 
worms and humans have approximately the 
same number of both tyrosine kinases and 
serine/threonine kinases (Table 19). It is im- 
portant to note, however, that these are mere- 
ly counts of the catalytic domain; the proteins 
that contain these domains also display a 
wide repertoire of interaction domains with 
significant combinatorial diversity. 

Hemostasis. Hemostasis is regulated pri- 
marily by plasma proteases of the coagulation 
pathway and by the interactions that occur be- 
tween the vascular endothelium and platelets. 
Consistent with known anatomical and physio- 
logical differences between vertebrates and in- 
vertebrates, extracellular adhesion domains that 
constitute proteins integral to hemostasis are 
expanded in the human relative to the fly and 
worm (Tables 18 and 19). We note the evolu- 
tion of domains such as FTMAC, FN1, FN2, 
and Clq that mediate surface interactions be- 
tween hematopoeitic cells and the vascular ma- 
trix. In addition, there, has been extensive re- . 
cruitment of more-ancient animal-specific do- 
mains such as VWA, VWC, VWD, kringle, 
and FN3 into multidomain proteins that are 
involved in hemostatic regulation. Although we 
do not find a large expansion in the total num- 
ber of serine proteases, this enzymatic domain 
has been specifically recruited into several of 
these multidomain proteins for proteolytic reg- 
ulation in the vascular compartment These are 
represented in plasma proteins that belong to 
the kinin and complement pathways. There is a 
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significant expansion in two families of matrix 
metalloproteases: ADAM (a disintegrin and 
metalloprotease) and MMPs (matrix metallo- 
proteases) (Table 19). Proteolysis of extracel- 
lular matrix (ECM) proteins is critical for tissue 
development and for tissue degradation in dis- 
eases such as cancer, arthritis, Alzheimer's dis- 
ease, and a variety of inflarnmatory conditions 
(755, 136). ADAMs are a family of integral 
membrane proteins with a pivotal role'in fibrin- 
ogenolysis and • modulating interactions ; be- 
tween hematopoietic components and the 
vascular matrix components. These proteins 
have been shown to cleave matrix proteins, 
and even signaling molecules: ADAM- 17 
converts tumor necrosis factor-a, and 
ADAM- 10 has been implicated in the Notch 
signaling pathway (135). We have identified 
19 members of the matrix metalloprotease 
family, and a total of 51 members of the 
ADAM and ADAM-TS families. 

Apoptosis. Evolutionary conservation of 
some of the apoptotic pathway components 
across eukarya is consistent with its central 
role in developmental regulation and as a 
response to pathogens and stress signals. The 
signal transduction pathways involved in pro- 
grammed cell death, or apoptosis, are medi- 
ated by interactions between well-character- 
ized domains that include extracellular do- 
mains, adaptor (protein-protein interaction) 
domains, and those found in effector and 
regulatory enzymes (137). We enumerated 
ie protein counts of central adaptor and ef- 
:ctor enzyme domains that are found only in 
the apoptotic pathways to provide an estimate 
of divergence across eukarya and relative 
expansion in the human genome when com- 
pared with the fly and worm (Table 18). 
Adaptor domains found in proteins restricted 
only to apoptotic regulation such as the DED 
domains are vertebrate-specific, whereas oth- 
ers like BIR, CARD, and Bcl2 are represent- 
ed in the fly and worm (although the number 
of Bcl2 family members in humans is signif- 
icantly expanded). Although plants and yeast 
lack the caspases, caspase-like molecules, 
namely the para- and meta-caspases, have 
been reported in these organisms (138). Com- 
pared with other animal genomes, the human 
genome shows an expansion in the adaptor 
and effector domain^ontaining proteins in- 
volved in apoptosis, as well as in the pro- 
teases involved in the cascade such as the 
caspase and calpain families. 

Expansions of other protein families. 
Metabolic enzymes. There are fewer cyto- 
chrome P450 genes in humans than in either 
the fly or worm. Lipoxygenases (six in hu- 
mans), on the other hand, appear to be specific 
to the vertebrates and plants, whereas the lip- 
^wgenase-activattng proteins (four in humans) 
be vertebrate-specific. Lipoxygenases are 
^■Polved in arachidonic acid metabolism, and 
they and their activators have been implicated 
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in diverse human pathology ranging from 
allergic responses to cancers. One of the most 
surprising human expansions, however, is in 
the number of gIyceraldehyde-3-phosphate 
dehydrogenase (GAPDH) genes (46 in hu- 
mans, 3 in the fly, and 4 in the worm). There 
is, however, .evidence for many retrotrans- 



posed GAPDH pseudogenes (139), which 
may account for this apparent expansion. 
However, it is interesting that GAPDH, long 
known as a conserved enzyme involved- in 
basic metabolism found across all phyla from 
bacteria to humans, has recently been shown 
to have other functions. It has a second cat- 
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alytic activity, as a uracil DNA glycosylase 

(140) and functions as a cell cycle regulator 

(141) and has even been implicated in apo- 
ptosis (142). 

Translation. Another striking set of hu- 
man expansions has occurred in certain fam- 
ilies involved in the translational machinery. 
We identified 28 different ribosomal subunits 
• that each have at least 10 copies in the ge-, 
nome; on average, for all ribosomal proteins 
there is about an 8- to 10-fold expansion in 
the number of genes relative to either the 
worm or fly. Retrotransposed pseudogenes 

\. ■ 
Table 19 [Continued) 
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; may account for many of these expansions 
[see the discussion above and (143)]. Recent 
evidence suggests that a number of ribosomal 
proteins have secondary functions indepen- 
dent of their involvement in protein biosyn- 
thesis; for example, L13a and the related L7 
subunits (36 copies in humans) have been 
shown to induce apoptosis (144). 

There is also a four- to fivefold expansion 
in the elongation factor 1 -alpha family 
.(eEFIA; 56 human genes). Many of these 
expansions likely represent intronless para- 
logs that have presumably arisen from retro- 
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transposition, and again there is evidence that 
many of these may be pseudogenes (145). 
However, a second form (eEFlA2) of this 
factor has been identied with tissue-specific 
expression in skeletal muscle and a comple- 
mentary expression pattern to the ubiquitous- 
ly expressed eEFIA (146). 

Ribonucleoproteins. Alternative splicing 
results in .multiple transcripts from a single 
gene, and can therefore generate additional 
diversity in an organism's protein comple- 
ment. We have identified 269 genes for ri- 
bonucleoproteins. This represents over 2.5 
times the number of ribonucleoprotein genes 
in the worm, two times that of the fly, and 
about the same as the 265 identified in the 
Arabidopsis genome. Whether the diversity 
of ribonucleoprotein genes in humans con- 
tributes to gene regulation at either the splic- 
ing or translational level is unknown. 

Posttranslational modifications. In this 
set ofprocesses, the most prominent expan- 
sion is the transglutaminases, calcium-depen- 
dent enzymes that catalyze the cross-linking 
of proteins in cellular processes such as he- 
mostasis and apoptosis (147). The vitamin 
K-dependent gamma carboxylase gene prod- 
uct acts on the GLA domain (missing in the 
fly and worm) found in coagulation factors, 
osteocalcin, and matrix GLA protein (148). 
Tyrosylprotein. sulfotransferases participate . 
in the posttranslational modification of pro- 
teins involved in inflammation and hemosta- 
sis, including coagulation factors and chemo- 
kine receptors (149). Although there is no 
significant numerical increase in the counts 
for domains involved in nuclear protein mod- 
ification, there are a number of domain ar- 
rangements in the predicted human proteins 
that are not found in the other currently se- 
quenced genomes. These include the tandem 
association of two histone deacetylase do- 
mains in HD6 with a ubiquitin finger domain, 
a feature lacking in the fly genome. An ad- 
ditional example is the co-occurrence of im- 
portant nuclear regulatory enzyme PARP 
(poly-ADP ribosyl transferase) domain fused 
to protein-interaction domains— BRCT and 
VWA in humans. 

Concluding remarks. There are several 
possible explanations for the differences in 
phenotypic complexity observed in humans 
when compared to the fly and worm. Some of 
these relate to the. prominent differences in 
the immune system, hemostasis, neuronal, 
vascular, and cytoskeletal complexity. The 
finding that the human genome contains few- 
er genes than previously predicted might be 
compensated for by combinatorial diversity 
generated at the levels of protein architecture, 
transcriptional and translational control, post- 
translational modification of proteins, or 
posttranscriptional regulation. Extensive do- 
main shuffling to increase or alter combina- 
torial diversity can provide an exponential 
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increase in the ability to mediate protein- 
protein interactions without dramatically in- 
creasing the absolute size of the protein com- 
plement (J 50). Evolution of apparently new 
(from the perspective of sequence analysis) 
protein domains and increasing regulatory 
complexity by domain accretion both quanti- 
tatively and qualitatively (recruitment of nov- 
el domains with preexisting ones) are two 
features that, we observe in humans. Perhaps " 
the best illustration of this trend is the C2H2 
zinc finger-containing transcription factors' 
where we see expansion in the number of 
domains per protein, together with verte- 
brate-specific domains such as KRAB and 
SCAN. Recent reports on the prominent use 
of internal nbosomal entry sites in the human 
genome to regulate translation of specific 
classes of proteins suggests that this is an area 
that needs further research to identify the full 
extent of this process in the human genome 
{151). At the posttranslational level, although 
we provide examples of expansions of some ■ 
protem families involved in these modifica- 
tions, further experimental evidence is re- 
quired to evaluate whether this is correlated 
with mcreased complexity in protein process- 
ing. Posttranscnptionai processing and the 
extent of isoform generation in the human 
remain to be cataloged in their entirety. Given 
the conserved nature of the spliceosomal ma- 
chinery, further analysis will be required to 
dissect regulation at this level. 

I Conclusions 



The Human genome 

Table 19 (Continued) 



Panther family/subfamily* 



H 



W 



COE 2 2inC finger ~ contain ' n St 
CREB 

ETS-retated 
Forkhead-related 
FOS . 
Croucho 
Histone H1 
Histone H2A 
Histone H2B 
Histone H3 
Histone H4 
Homeoticf 

ABD-B 

Bithoraxoid 

Iroquois class 

Distal-less 

Engrailed 

UM-containing 

MEIS/KNOX class 

NK-3/NK-2 class 

Paired box 
Six 

Leucine zipper 
Nuclear hormone receptorf 
Pou-related 
Runt-related 



Transcription factors/chromatin organization 



8.1 The whole-genome sequencing 
approach versus BAC by BAC 

Experience in applying the whole-genome 
shotgun sequencing approach to a diverse 
group of organisms with a wide range of 
genome sizes and repeat content allows us to 
assess its strengths and weaknesses. With the 
success of the method for a large number of 
microbial genomes, Drosophih, and now the 
human, there can be no doubt concerning the 
utility of this method. The large number of 
microbial genomes that have been sequenced 
by this method (75, 80, 152) demonstrate that 
mcgabasc-sized genomes can be sequenced 
efficiently without any input other that the de 
novo mate-paired sequences. With more 
complex genomes like those of Drosophila or 
human map information, in the form of well- 
ordered markers, has been critical for long- 
range ordering of scaffolds. For joining scaf- 
folds into chromosomes, the quality of the 
map (in terms of the order of the markers) is 
more important than the number of markers 
per se. Although this mapping could have 
been performed concurrently with sequenc- 
m^ihe prior existence of mapping data was 
^■jficial. During the sequencing of the A 
^P*wa genome, sequencing of individual 
clones permitted extension of the se- 
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quence well into centromeric regions and al- 
lowed high-quality resolution of complex re- 
peat regions. Likewise, in Drosophila, the 
BAC physical map was most useful in re- 
gions near the highly repetitive centromeres 
and telomeres. WGA has been found to de- 
liver excellent-quality reconstructions of the 
unique regions of the genome. As the genome 
size, and more importantly the repetitive con- 
tent, increases, the WGA approach delivers 
less of the repetitive sequence. 

The cost and overall efficiency of clone-by- 
clone approaches makes them difficult to justify 
as a stand-alone strategy for future large-scale 
genome-sequencing projects. Specific" applica- 
tions of BAC-based or other clone mapping and 
sequencing strategies to resolve ambiguities in 
sequence assembly that cannot be efficiently 
resolved with computational approaches alone 
are clearly worth exploring. H^jrid approaches 
to whole-genome sequencing will only work if 
there is sufficient coverage in both the whole- 
genome shotgun phase and the BAC clone se- 
quencing phase.. Our experience with human 
genome assembly suggests that this will require 
at least 3 X coverage of both whole-genome and 
BAC shotgun sequence data, 

8.2 The low gene number in humans 

We have sequenced and assembled ~95% of 
the euchrpmatic sequence of H. sapiens and 
used a new automated gene prediction meth- 
od to produce a preliminary catalog of the 
human genes. This has provided a major sur- 
prise: We have found far fewer genes (26,000 
to 38,000) than the earlier molecular pre- 
dictions (50,000 to over 140,000). Whatever 
the reasons for this current disparity, only 
detailed annotation, comparative genomics 
(particularly using the Mus musculus ge- 
nome), and careful molecular dissection of 
complex phenotypes will clarify this critical 
issue of the basic "parts list" of our genome. 
Certainly, the analysis is still incomplete and 
considerable refinement will occur in the 
years to come as the precise structure of each 
transcription unit is evaluated. A good place 
to start is to determine why the gene esti- 
mates derived from EST data are so discor- 
dant with our predictions. It is likely that the 
following contribute to an inflated gene num- 
ber derived from ESTs: the variable lengths 
of 3'- and 5'-untransIated leaders and trailers; 
the little-understood vagaries of RNA pro- 
cessing that often leave intronic regions in an 
unspliced condition; the finding that nearly 
40% of human genes are alternatively spliced 
(753); and finally, the unsolved technical 
problems in EST library construction where 
contamination from heterogeneous nuclear 
RNA and genomic DNA are not uncommon. 
Of course, it is possible that there are genes 
. that remain unpredicted owing to the absence 
of EST or protein data to support them, al- 
though our use of mouse genome data for 
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predicting genes should limit this number. As 
was true at the beginning of genome sequenc- 
ing, ultimately it will be necessary to measure 
mRNA in specific cell types to demonstrate 
the presence of a gene. 

J. B. S. Haldane speculated in 1937 that a 
population of organisms might have to pay a 
price for the number of genes it can possibly 
cany. He theorized that when the number of 
genes becomes too large, each zygote carries 
so many new deleterious mutations that the 
population simply cannot maintain itself. On 
. the basis of this premise, and on the basis of 
available mutation rates and x-ray-induced 
mutations at specific loci, Muller, in 1967 
(154), calculated that the mammalian ge- 
nome would contain a maximum of not much 
more than 30,000 genes (155). An estimate of 
30,000 gene loci for humans was also arrived 
at by Crow and Kimura (156). Muller's esti- 
mate forD. melanogaster was 10,000 genes, 
compared to 13,000 derived by annotation of 
the fly genome (26, 27). These arguments for 
the theoretical maximum gene number were 
based on simplified ideas of genetic load — 
that all genes have a certain low rate of 
mutation to a deleterious state. However, it is 
clear that many mouse, fly, worm, and yeast 
knockout mutations lead to almost no dis- 
cernible phenotypic perturbations. 

The modest number of human genes 
means that we must look elsewhere for the 
mechanisms that generate the complexities 
inherent in human development and the so- 
phisticated signaling systems that maintain 
homeostasis. There are a large number of 
ways in which the functions of individual 
genes and gene products are regulated. The 
degree of "openness" of chromatin structure 
and hence transcriptional activity is regulated 
by protein complexes that involve histone 
and DNA enzymatic modifications. We enu- 
merate many of the proteins that are likely 
involved in nuclear regulation in Table 19. 
The location, timing, and quantity of tran- 
scription are intimately linked to nuclear sig- 
nal transduction events as well as by the 
tissue-specific expression of many of these 
proteins. Equally important are regulatory 
DNA elements that include insulators, re- 
peats, and endogenous viruses (157); meth- 
ylation of CpG islands in imprinting (158); 
and promoter-enhancer and intronic regions 
that modulate transcription. The spliceosomal 
machinery consists of multisubunit proteins 
(Table 19) as well as structural and catalytic 
RNA elements (159) that regulate transcript 
structure through alternative start and termi- 
nation sites and splicing. Hence, there is a 
need to study different classes of RNA mol- 
ecules (160) such as small nucleolar RNAs, 
antisense riboregulator RNA, RNA involved 
in X-dosage compensation, and other struc- 
tural RNAs to appreciate their precise role in 
regulating gene expression. The phenomenon 



of RNA editing in which coding changes i 
occur directly at the level of mRNA is of I 
clinical and biological relevance (161). Final- 
ly, examples of translation^ control include 
internal ribosomal entry sites that are found 
in proteins involved in cell cycle regulation 
and apoptosis (162). At the protein level, 
minor alterations in the .nature of protein- 
protein interactions, protein modifications, 
and localization can have dramatic effects on 
cellular physiology (163). This dynamic sys- 
tem therefore has many ways to modulate 
activity, which suggests that definition of 
complex systems by analysis of single genes 
is unlikely to be entirely successful. 

Tn situ studies have shown that the human 
genome is asymmetrically populated with 
G+C content, CpG islands, and genes (68). 
However, the genes are not distributed quite 
as unequally as had been predicted (Table 9) 
(69). The most G+C-rich fraction of the ge- 
nome, H3 isochores, constitute more of the 
genome than previously thought (about 9%), 
and are the most gene-dense fraction, but 
contain only 25% of the genes, rather than the 
predicted -40%. The low G+C L isochores 
make up 65% of the genome, and 48% of the 
genes. This inhomogeneity, the net result of 
millions of years of mammalian gene dupli- 
cation, has been described as the "desertifi- 
cation" of the vertebrate, genome (71). Why 
are there clustered regions of high and low 
gene density, and are these accidents of his- 
tory or driven by selection and evolution? If 
these deserts are dispensable, it ought to be 
possible to find mammalian genomes that are 
far smaller in size than the human genome. 
Indeed, many species of bats have genome 
sizes that are much smaller than that of hu- 
mans; for example, Miniopterus, a species of 
Italian bat, has a genome size that is only 
50% that of humans (164). Similarly, Mun- 
tiacus, a species of Asian barking deer, has a 
genome size that is -70% that of humans. 

8.3 Human DNA sequence variation 
and its distribution across the genome 

This is the first eukaryotic genome in which a 
nearly uniform ascertainment of polymorphism 
has been completed. Although we have identi- 
fied and mapped more than 3 million SNPs, this 
by no means implies that the task of finding and 
cataloging SNPs is complete. These represent 
only a miction of the SNPs present in the 
human population as a whole. Nevertheless, 
this first glimpse at genome-wide variation has 
revealed strong inhomogeneities in the distribu- 
tion of SNPs across the genome. Polymorphism 
in DNA carries with it a snapshot of the past 
operation of population genetic forces, includ- 
ing mutation, migration, selection, and genetic 
drift. The availability of a dense array of SNPs 
will allow questions related to each of these 
factors to be addressed on a genome-wide basis. 
SNP studies can establish the range of haplo- 
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types present in subjects of different ethnogeo- 
graphic origins, providing insights into popula- 
tion history and migration patterns. Although 
such studies have suggested that modem human 
lineages derive from Africa, many important 
questions regarding human origins remain un- 
answered, and more analyses using detailed 
SNP maps will be needed to settle these con- 
troversies. In addition to providing evidence for 
population expansions, migration, -arid admix-, 
ture, SNPs can serve as markers for the extent 
of evolutionary constraint acting on particular 
genes. The correlation between patterns of in- 
traspecies and interspecies genetic variation 
may prove to be especially informative to iden- 
tify sites of reduced genetic diversity that may 
mark loci where sequence variations are not 
tolerated. 

The remarkable heterogeneity in SNP 
density implies that there are a variety of 
forces acting on polymorphism— sparse re- 
gions may have lower SNP density because 
the mutation rate is lower, because most of 
those regions have a lower fraction of muta- 
tions that are tolerated, or because recent 
strong selection in favor of a newly arisen 
allele "swept" the linked variation out of the 
population (165). The effect of random ge- 
netic drift also varies widely across the ge- 
nome. The nonrecombining portion of the Y 
chromosome faces the strongest pressure 
from random drift because there are roughly 
one-quarter as many Y chromosomes in the 
population as there are autosomal chromo- 
somes, and the level of polymorphism on the 
Y is correspondingly less. Similarly, the X 
chromosome has a smaller effective popu- 
lation size than the autosomes, and its nu- 
cleotide diversity is also reduced. But even 
across a single autosome, the effective pop- 
ulation size can vary because the density of 
deleterious mutations may vary. Regions of 
high density of deleterious mutations will 
see a greater rate of elimination by selec- 
tion, and the effective population size will 
be smaller (166). As a result, the density of 
even completely neutral SNPs will be lower 
in such regions. There is a large literature 
on the association between SNP density 
and local recombination rates in Drosoph- 
ila, and it remains an important task to 
assess the strength of this association in the 
human genome, because of its impact on 
the design of local SNP densities for dis- 
ease-association studies. It also remains an 
important task to validate SNPs on a 
genomic scale in order to assess the degree 
of heterogeneity among geographic and 
ethnic populations. 
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8.4 Genome complexity 
We will soon be in a position to move away 
^rom the cataloging of individual compo- 
lents of the system, and beyond the sim- 
plistic notions of "this binds to that, which 



then docks on this, and then the complex 
moves there. . , (167) to the exciting area 
of network perturbations, nonlinear re- 
sponses and thresholds, and their pivotal 
role in human diseases. 

The enumeration of other '*parts lists" re- 
veals that in organisms with complex nervous 
systems, neither gene number, neuron number, 
nor number of cell types correlates in any* 
meaningful manner with even simplistic mea- 
sures of structural or behavioral complexity. 
Nor would they be expected to; this is the realm 
of nonlinearities and epigenesis (168). The 520 
million neurons of the common octopus exceed 
the neuronal number in the brain of a mouse by 
an order of magnitude. It is apparent from a 
comparison of genomic data on fee mouse and 
human, and from comparative mammalian neu- 
roanatomy (169), that the morphological and 
behavioral diversity found in mammals is un- 
derpinned by a similar gene repertoire and sim- 
ilar neuroanatomies. For example, when one 
compares a pygmy marmoset (which is only 4 
inches tall and weighs about 6 ounces) to a 
chimpanzee, the brain volume of this minute 
primate is found to be only about 1.5 cm 3 , two 
orders of magnitude less than that of a chimp 
and three orders less than that of humans. Yet 
the neuroanatomies of all three brains are strik- 
ingly similar, and the behavioral characteristics 
of the pygmy marmoset are little different from 
those of chimpanzees. Between humans and 
chimpanzees, the gene number, gene structures 
and functions, chromosomal and genomic or- 
ganizatiohs, and cell types and neuroanatomies 
are almost indistinguishable, yet the develop- 
mental modifications that predisposed human 
lineages to cortical expansion and development 
of the larynx, giving rise to language, culminat- 
ed in a massive singularity that by even the 
simplest of criteria made humans more com- 
plex in a behavioral sense. 

Simple examination of the number of neu- 
rons, cell types, or genes or of the genome 
size does not alone account for the differenc- 
es in complexity that we observe. Rather, it is 
the interactions within and among these sets 
that result in such great variation. In addition, 
it is possible that there are "special cases" of 
regulatory gene networks that have a dispro- 
portionate effect on the overall system. We 
have presented several examples of "regula- 
tory genes" that are "significantly increased in 
the human genome compared with the fly and 
worm. These include extracellular Iigands 
and their cognate receptors (e.g., wnt, friz- 
zled, TGF-p, ephrins, and connexins), as well 
as nuclear regulators (e.g., the KRAB and 
homeodomain transcription factor families), 
where a few proteins control broad develop- 
mental processes. The answers to these 
"complexities" perhaps lie in these expanded 
gene families and differences in the regulato- 
ry control of ancient genes, proteins, path- 
ways, and cells. 



8.5 Beyond single components 

While few would disagree with the intuitive 
: conclusion that Einstein's brain was more 
complex than that of Drosophili, closer com- 
pansons such as whether the set of predicted 
human proteins is more complex than the 
protein set of Drosophila, and if so, to what 
degree, are not straightforward, since protein, 
. protein domain, or protein-protein interaction 
measures do not capture context-dependent 
interactions that underpin, toe dynamics ;un- ; 
derlying phenotype. 

• Currently, there are more than 30 different 
mathematical descriptions of complexity (170). 
However, we have yet to understand the math- 
ematical dependency relating the number of 
genes with organism complexity. One pragmat- 
ic approach to the analysis of biological sys- 
tems, which are composed of nonidentical ele- 
ments (proteins, protein complexes, interacting 
cell types, and interacting neuronal popula- 
tions), is through graph theory (171). The ele- 
ments of the system can be represented by the 
vertices of complex topographies, with the edg- 
es representing the interactions between them. 
Examination of large networks reveals that they 
can self-organize, but more important, they can 
be particularly robust This robustness is not 
due to redundancy, but is a property of inho- 
mogeneously wired networks. The error toler- 
ance of such networks comes with a price; they 
are vulnerable to the selection or removal of a 
few nodes that contribute disproportionately to 
network stability. Gene , knockouts provide an ' 
illustration. Some knockouts may have minor 
effects, whereas others have catastrophic effects 
on the system. In the case of vimentin, a sup- 
posedly critical component of the cytoplasmic 
intermediate filament network of mammals, the 
knockout of the gene in mice reveals them to be 
reproductively normal, with no obvious phenc- 
rypic effects (772), and yet the usually conspic- 
uous vimentin network is completely absent. 
On the other hand, -30% of knockouts in 
Drosophila and mice correspond to critical 
nodes whose reduction in gene product, or total 
elimination, causes the network to crash most 
of the time, although even in some of these 
cases, phenotypic normalcy ensues, given the 
appropriate genetic background. Thus, there are 
no "good" genes or "bad" genes, but only net- 
works that exist at various levels and at differ- 
ent connectivities, and at different states of 
sensitivity to perturbation. Sophisticated math- 
ematical analysis needs to be constantly evalu- 
ated against hard biological data sets that spe- 
cifically address network dynamics. Nowhere is 
this more critical than in attempts to come to 
grips with "complexity/* particularly because 
deconvolving and correcting complex net- 
works that have undergone perturbation, and 
have resulted in human diseases, is the greatest 
significant challenge now facing us. 

It has been predicted for the last 15 years 
that complete sequencing of the human ge- 
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. nome would open up new strategies for hu- 
man biological research and would have a 
major impact on medicine, and through med- 
icine and public health, on society. Effects on 
biomedical research are already being felt. 
This assembly of the human genome se- 
quence is but a first, hesitant step on a long 
and exciting journey toward understanding 
the role of the genome in human biology. It • 
has been possible only because of innova- 
tions in instrumentation and. software that 
have allowed automation of almost every step 
of the process from DNA preparation to an- , 
notation. The next steps are clear: We must 
' define the complexity that ensues when this 
: relatively modest set of about 30,000 genes is 
expressed. The sequence provides the frame- 
work upon which all the genetics, biochem- 
istry, physiology, and ultimately phenotype 
depend. It provides the boundaries for scien- 
tific inquiry. The sequence is only the first 
level of understanding of the genome. All 
genes and their control elements must be 
identified; their functions, in concert as well 
as in isolation, defined; their sequence varia- 
tion worldwide described; and the relation 
between genome variation and specific phe- 
notypic characteristics determined. Now we 
know what we have to explain. 

Another paramount challenge awaits: 
public discussion of this information and its 
potential for improvement of personal health. 
Many diverse sources of data have shown 
that any two individuals are more than 99.9% 
identical in sequence, which means that all 
the glorious differences among individuals in 
our species that can be attributed to genes 
falls in a mere 0.1% of the sequence. There 
are two fallacies to be avoided: determinism, 
the idea that all characteristics of the person 
are | 4 hard-wired" by the genome; and reduc- 
tionism, the view that with complete knowl- 
edge of the human genome sequence, it is 
only a matter of time before our understand- 
ing of gene functions and interactions will 
provide a complete causal description of hu- 
man variability. The real challenge of human 
biology, beyond the task of finding out how 
genes orchestrate the construction and main- 
tenance of the miraculous mechanism of our 
bodies, will lie ahead as we seek to explain 
how our minds have come to organize 
thoughts sufficiently well to investigate our 
own existence. 
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- A historic 
moment for 
the scientific 
endeavor. 



THE HUMAN 
GENOME 

: umanity has been given a great gift. With the completion of the human .. . 
genome sequence, we have received a powerful tool for unlocking the " 
secrets of our genetic heritage and for finding our place among the other 
participants in the adventure of life. 

This week's issue of Science contains the report of the sequencing of 
the human genome from a group of authors led by Craig Venter!of Celera v - 
Genomics. The report of the sequencing of the human genome from the 
publicly funded consortium of laboratories led by Francis Collins appears 
in this week's Nature, This stunning achievement has been portrayed — 
often unfairly — as a competition between two 
ventures, one public and one private. That characterization detracts from 
the awesome accomplishment jointly unveiled this week. In truth, each 
project contributed to the other. The inspired vision that launched the 
publicly funded project roughly 10 years ago reflected, and now rewards, 
the confidence of those who believe that the pursuit of large-scale funda- 
mental problems in the life sciences is in the national interest The technical 
innovation and drive of Craig Venter and his colleagues made it possible 
to celebrate this accomplishment far sooner than was believed possible. 
Thus, we can salute what has become, in the end, not a contest but a 
marriage (perhaps encouraged by shotgun) between public funding and 
private entrepreneurship. 

There are excellent scientific reasons for applauding an outcome that 
has given us two winners. Two sequences are better than one; the opportunity for comparison and con- 
vergence is invaluable. Indeed, a real-world proof of the importance of access to both "sets of data pan 
be found in the pages of this issue of Science, in the comparative analysis by Olivier et al (p. 1298). 

Although we have made the point before, it is worth repeating that the sequencing of the human 
genome represents, not an ending, but the beginning of a new approach to biology. As Galas says^n 
his Viewpoint (p. 1257), the knowledge that all of the genetic components of any process can be 
identified will give extraordinary new power to scientists. Because of this breakthrough, research 
can evolve from analyzing the effects of individual genes to a more integrated view that examines 
whole ensembles of genes as they interact to form a living human being. Several articles in this issue 
■ highlight how this approach is already beginning to revolutionize the way we look at human disease. 
This has been a massive project, on a scale unparalleled in the history of biology, but of course 
it has built on the scientific insights of centuries of investigators. By coincidence, this landmark 
announcement falls during the week of the anniversary of the birth of Charles Darwin. Darwin s 
message that the survival of a species can depend on its ability to evolve in the face of change is 
peculiarly pertinent to discussions that have gone on in the past year over access to the Celera data. 
(Full information regarding the agreements that were reached to make the data available can be 
found at www.sciencemag.org/feature/data/announcement/gsp.shl.) We are willing to be flexible. ?i : 
allowing data repositories other than the traditional GenBank, while insisting on access to all the 
data needed to verify conclusions. In this domain, change is everywhere: Commercial researchers 
are producing more and more potentially valuable sequences, yet (at least in the United States) 
laws governing databases provide scant protection against piracy. Had the Celera data been kept se- 
cret it would have been a serious loss to the scientific community. We hope that our adaptability in 
the face of change will enable other proprietary data to be published after peer review, in a way that 
satisfies our continuing commitment to full access. / . : ? : j • 

It should be no surprise that an achievement so stunning, and so carefully watched, has created 
new challenges for the scientific venture. Science is proud to have played a role in bringing this 
discovery onto the public stage. It is literally true that this is a historic moment for the scientific en- 
deavor. The human genome has been called the Book of Life. Rather, it is a library in which, with 
rules that encourage exploration and reward creativity, we can find many of the books that will 
help define us and our place in the great tapestry of life. ^ ■ 

Barbara R. Jasny and Donald Kennedy 
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115: 



Query= SEQ ID N0:1 

(975 letters) 



Sequences producing significant alignments: 

■ AC116156. 3. 1.176597 
AC109341. 7. 1.202761 

>AC116156. 3. 1.176597 

Length = 176597 

Score = 1917 bits (967), Expect = 0.0 
Identities = 973/975 (99%) 
Strand = Plus / Minus 



Score E 

(bits) Value 

1917 0.0 

1917 0 ,, 0 



Ouerv 1 atgaatcatatgtctgcatctctcaaaatctccaatagctccaaattccaggtctctgag 60 

lllllllllllllllllllll MINIM IIIIIIIIIIIIIIIIMI 

Sbjct : 149220 atgaatcatatgtctgcatctctcaaaatctccaatagctccaaattccaggtctctgag 



149161 



Query: 61 
Sbjct: 149160 



ttcatcctgctgggattcccgggcattcacagctggcaacactggctatctctgcccctg 

MIIMMMIMMMIIMIMMMIMMMMMIMMMMIMIIIIIIMI 

ttcatcctgctgggattcccgggcattcacagctggcaacactggctatctctgcccctg 



120 



149101 



Ouerv: 121 gcactactgtatctctcagcacttgctgcaaacaccctcatcctcatcatcatctggcag 

II III, I M MM Mill llllll II 

Sbjct : 149100 gcactactgtatctctcagcacttgctgcaaacaccctcatcctcatcatcatctggcag 



180 



149041 



Query: 181 
Sbjct: 149040 



aacccttctttacagcagcccatgtatattttccttggcatcctctgtatggtagacatg 

IMIMIMMMIIMIMIIIMMMMMIMIMIMMMMIIIIMIIIMI 

aacccttctttacagcagcccatgtatattttccttggcatcctctgtatggtagacatg 



240 



148981 



Ouerv: 241 ggtctggccactactatcatccctaagatcctggccatcttctggtttgatgccaaggtt 300 

IIIIMMMMMIMIMMIIIIIMI I I MM MM 

Sbjct: 148980 ggtctggccactactatcatccctaagatcctggccatcttctggtttgatgccaaggtt 



148921 



Query: 301 
Sbjct: 148920 



attagcctccctgagcgctttgctcagatttatgccattcacttctttgtgggcatggag 

MUI I II II II MM' II Ml Mill MM II' I III II 

attagcctccctgagtgctttgctcagatttatgccattcacttctttgtgggcatggag 



360 



148861 



Query: 361 
Sbjct: 148860 



tctggtatcctactctgcatggcttttgatagatatgtggctatttgtcaccctcttcgc 

MIIIMIMIMIIMIIMIMIMMIMMMMMMMMMMIIMIIIMI 

tctggtatcctactctgcatggcttttgatagatatgtggctatttgtcaccctcttcgc 



420 



148801 



Query: 421 
Sbjct: 148800 



tatccatcaattgtcaccagttccttaatcttaaaagctaccctgttcatggtgctgaga 
Ml, III 1 ' : I I : I I i I I I : I 1 I I I ' ■ I I I I 1 i I I : I I I , I I I ' I I 

tatccatcaattgtcaccagttccttaatcttaaaagctaccctgttcatggtgctgaga 



480 



148741 



Ouerv 481 aatggcttatttgtcactccagtgcctgtgcttgcagcacagcgtgattattgctccaag 540 

iiiTTiiiiiiiiiiiiiiiiin in iiiiiiiiii ii 1 1 

Sbjct: 148740 aatggcttatttgtcactccagtgcctgtgcttgcagcacagcgtgattattgctccaag 148681 
Ouerv 541 aatgaaattgaacactgcctgtgctctaaccttggggtcacaagcctggcttgtgatgac 600 

Ml I III II 1 1 III I III Ill MM I III I III I.I.I 1 1 MUM . 

Sbjct: 148680 aatgaaattgaacactgcctgtgctctaaccttggggtcacaagcctggcttgtgatgac 148621 
Ouerv 601 aggaggccaaacagcatttgccagttggttctggcatggcttggaatggggagtgatcta 660 

MM MM 1 1 MM II Mill MM MM III MM 

Sbjct: 148620 aggaggccaaacagcatttgccagttggttctggcatggcttggaatggggagtgatcta 148bbl 
Ouerv 661 agtcttattatactgtcatatattttgattctgtactctgtacttagactgaactcagct 720 

Mill Mill I I I MUM 4fiRni 

Sbjct: 148560 agtcttattatactgtcatatattttgattctgtactctgtacttagactgaactcagct 148501 
Ouerv: 721 gaagctgcagccaaggccctgagcacttgtagttcacatctcaccctcatccttttcttt 780 

I llllll Ml Ml Mill MM I 1 1 1 1 , AQAA , 

Sbjct: 148500 gaagctgcagccaaggccctgagcacttgtagttcacatctcaccctcatccttttcttt 148441 
Ouerv 781 tacactattgttgtagtgatttcagtgactcatctgacagagatgaaggctactttgatt 840 

I II MM MM II II MM II Ml II Ml II 

Sbjct: 148440 tacactattgttgtagtgatttcagtgactcatctgacagagatgaaggctactttgatt 148 J81 
Ouerv 841 ccagttctacttaatgtgttgcacaacatcatccccccttccctcaaccctacagtttac 900 

I II II I Mill iAM9i 

Sbjct: 148380 ccagttctacttaatgtgttgcacaacatcatccccccttccctcaaccctacagtttat l48J^i 
Ouerv: 901 gcacttcagaccaaagaacttagggcagccttccaaaaggtgctgtttgcccttacaaaa 960 

IMMMMIMMMIMMIIIMIIIM llllll I iAR9fii 

Sbjct: 148320 gcacttcagaccaaagaacttagggcagccttccaaaaggtgctgtttgcccttacaaaa I482bi 
Query: 961 gaaataagatcttag 975 

1 1 1 1 ii 1 1 1 1 1 1 1 M 

Sbjct: 148260 gaaataagatcttag 148246 



>AC109341. 7. 1.202761 

Length = 202761 



Score ^ 1917 bits (967), Expect = 0.0 
Identities = 973/975 (99%) 
Strand = Plus / Minus 



Query : 1 
Sbjct : 



atgaatcatatgtctgcatctctcaaaatctccaatagctccaaattccaggtctctgag 60 

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIMIIIIIIIIIIIIIII 

59592 atgaatcatatgtctgcatctctcaaaatctccaatagctccaaattccaggtctctgag 59533 



Query: 61 ttcatcctgctgggattcccgggcattcacagctggcaacactggctatctctgcccctg 120 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct: 59532 ttcatcctgctgggattcccgggcattcacagctggcaacactggctatctctgcccctg 59473 

Query: 121 gcactactgtatctctcagcacttgctgcaaacaccctcatcctcatcatcatctggcag 180 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct: 59472 gcactactgtatctctcagcacttgctgcaaacaccctcatcctcatcatcatctggcag 59413 

Query: 181 aacccttctttacagcagcccatgtatattttccttggcatcctctgtatggtagacatg 240 

MM I i 1 1 II II I! II Till II U : II MUM M Ml i M I MM 1 1 M I ! IN 

Sbjct: 59412 aacccttctttacagcagcccatgtatattttccttggcatcctctgtatggtagacatg 59353 



Query : 241 ggtc tggccac tac tatcatccctaagatcctggccatc ttc tggtttgatgccaaggt t 

MMM III Ml IMMMMMM'MM IMMI MM II III MMI 

Sbjct : 59352 ggtctggccactactatcatccctaagatcctggccatcttctggtttgatgccaaggtt 



300 



59293 



Query: 301 attagcctccctgagcgctttgctcagatttatgccattcacttctttgtgggcatggag 

IIMIMMIMM Ml II MM IMIIMIII II. IMMM MMI Ml 

Sbjct : 59292 attagcctccctgagcgctttgctcagatttatgccattcacttctttgtgggcatggag 



360 



59233 



Query : 
Sbjct: 



361 



59232 



tctggtatcctactctgcatggcttttgatagatatgtggctatttgtcaccctcttcgc 

.Ill i I Ml IIMIIMI MMMM II MMM Ml M II 

tctggtatcctcctctgcatggcttttgatagatatgtggctatttgtcaccctcttcgc 



420 



59173 



Query: 
Sbjct: 



421 tatccatcaattgtcaccagttccttaatcttaaaagctaccctgttcatggtgctgaga 480 

III M l 1 1 III INI INI Mil III I MINIM Ml M l I II II I II I lll l IN NT o ,,„ 

59172 tatccatcaattgtcaccagttccttaatcttaaaagctaccctgttcatggtgctgaga 59113 



Query: 481 aatggcttatttgtcactccagtgcctgtgcttgcagcacagcgtgattattgctccaag 

IIIIIIIIIMIIIIIIIIIIIIIIIMIIIIMIIIIIIIIIIIIIIIMIIIIIIIII 

Sbjct: 59112 aatggcttatttgtcactccagtgcctgtgcttgcagcacagcgtgattattgctccaag 



540 



59053 



Query: 
Sbjct: 



541 



aatgaaattgaacactgcctgtgctctaaccttggggtcacaagcctggcttgtgatgac 600 

IMMM MMM MMMMIMMMIMMMMIMIMIMI IIMIIMI 

59052 aatgaaattgaacactgcctgtgctctaaccttggggtcacaagcctggcttgtgatgac 



58993 



Query: 601 aggaggccaaacagcatttgccagttggttctggcatggcttggaatggggagtgatcta 

lilM!liMilliMil!!!l!!l!!!l!ll!MII!!l!!!MIIIMIIIIIIIIII 

Sbjct : 53992 aggaggccaaacagcatttgccagttggttctggcatggcttggaatggggagtgatcta 



660 



58933 



Query: 661 agtcttattatactgtcatatattttgattctgtactctgtacttagactgaactcagct 

i : : i i ; i i 1 1 1 ; i . i ; . ' ■ 1 1 i 1 1 1 : i ; 1 1 1 1 1 1 ! 1 : i 1 ■ i ■ i ! 1 1 1 1 

Sbjct: 58932 agtcttattatactgtcatatattttgattctgtactctgtacttagactgaactcagct 



720 



58873 



Query: 721 gaagctgcagccaaggccctgagcacttgtagttcacatctcaccctcatccttttcttt 780 

IIIIIIIIIIIIIIIIIIMIIIIIIIIIIMIIIIIIIIIIIIIIIIIIIIIIIIMII 

Sbjct: 58872 gaagctgcagccaaggccctgagcacttgtagttcacatc tcaccctcatccttttcttt 58813 



Query: 
Sbjct: 



781 tacactattgttgtagtgatttcagtgactcatctgacagagatgaaggctactttgatt 840 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiniiiiiiiiiiiiiiiiiiii 

58812 tacactattgttgtagtgatttcagtgactcatctgacagagatgaaggctactttgatt 58753 



Query: 841 ccagttctacttaatgtgttgcacaacatcatccccccttccctcaaccctacagtttac 900 

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIMIIIIillllllllllll 

Sbjct: 58752 ccagttctacttaatgtgttgcacaacatcatccccccttccctcaaccctacagtttat 58693 
Query: 901 gcacttcagaccaaagaacttagggcagccttccaaaaggtgctgtttgcccttacaaaa 960 

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 e 1 1 1 m 1 1 1 1 1 1 1 m 1 1 1 i 1 1 [ 1 1 1 1 1 1 1 1 1 r I! 1 1 1 1 1 1 1 1 1 1 1 . 

Sbjct : 58692 gcacttcagaccaaagaacttagggcagccttccaaaaggtgctgtttgcccttacaaaa 58633 



Query: 961 
Sbjct: 58632 



gaaataagatcttag 975 

MM! MM llllll 

gaaataagatcttag 58618 
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APPEAL BRIEF 

Sir: 

Appellants hereby submit an original and two copies of this Appeal Brief to the Board of Patent 
Appeals and Interferences ("the Board") in response to the Final Office Action mailed on May 22, 2003. 
The Notice of Appeal was timely submitted on August 22, 2003, and was received in the Patent and 
Trademark Office ("the Office") on August 28, 2003. This Appeal Brief is timely submitted in light of the 
concurrently filed Petition for an Extension of Time of two months to and including December 28, 2003 , 
which falls on a Sunday and is therefore extended until Monday, December 29, 2003 under 
37 C.F.R. § 1.7, and authorization to deduct the fee as required under 37 C.F.R. § 1.17(a)(2) from 
Appellants' Representatives' deposit account. The Commissioner is also authorized to charge the fee for 
filing this Appeal Brief ($165.00), as required under 37 C.F.R. § 1.17(c), to Lexicon Genetics 
Incorporated Deposit Account No. 50-0892. 

Appellants believe no fees in addition to the fee for filing the Appeal Brief and the fee for the 
extension of time are due in connection with this Appeal Brief. However, should any additional fees under 
37 C.F.R. §§ 1.16to 1.21 be required for any reason related to this communication, the Commissioner 
is authorized to charge any underpayment or credit any overpayment to Lexicon Genetics Incorporated 
Deposit Account No. 50-0892. 

I. REAL PARTY IN INTEREST 

The real party in interest is the Assignee, Lexicon Genetics Incorporated, 8 800 Technology Forest 
Place, The Woodlands, Texas, 7738 1 . 

II. RELATED APPEALS AND INTERFERENCES 

Appellants know of no related appeals or interferences that will directly affect or be directly 
affected by or have a bearing on the Board's decision in the pending appeal. 




-1- 



III. STATUS OF THE CLAIMS 

The present application was filed on July 26, 2001, claiming the benefit of U.S. Provisional 
Application Number 60/22 1 ,0 1 2, which was filed on July 27, 2000, and included original claims 1 and 2. 
A First Official Action on the merits ("the First Action") was issued on October 1 , 2002, in which claims 1 
and 2 were rejected under 35 U.S.C. § 101 as allegedly lacking a patentable utility, claims 1 and2 were 
rejected under 35 U.S.C. § 1 12, first paragraph as allegedly unusable by the skilled artisan due to the 
alleged lack of patentable utility, claim 1 was rejected under 35 U.S.C. § 1 12, second paragraph, as 
allegedly indefinite, and claim 1 was rejected under 35 U.S.C. § 102(a) as allegedly anticipated by 
Bellenson et al (WO 01/27 158; "Bellenson"). In a response to the First Official Action submitted to the 
Office on March 3, 2003 ("Response to the First Action"), Appellants amended claims 1 and 2, added 
new claims 3-5, and addressed the various rejections of claims 1 and 2. 

A Second and Final Official Action ("the Final Action") was issued on May 22, 2003, indicating 
that the rejections of claim 1 under 35 U.S.C. § 1 12, second paragraph, as allegedly indefinite, and claim 1 
under 35 U.S.C. § 102(a) as allegedly anticipated by Bellenson, had been overcome by the amendments 
and remarks submitted in the Response to the First Action, but maintaining the rejections of claims 1 and 2 
(and newly added claims 3-5) under 35U.S.C. § 101 as allegedly lacking a patentable utility, and under 
35 U.S.C. § 1 12, first paragraph as allegedly unusable by the skilled artisan due to the alleged lack of 
patentable utility. In a response to the Final Action submitted to the Office on August 22, 2003 (Response 
to the Final Action"), Appellants again addressed the rejections of claims 1-5. 

An Advisory Action ("the Advisory Action") was mailed on October 10, 2003, maintaining the 
rejections of claims 1-5 under 35 U.S.C. § 101 as allegedly lacking a patentable utility, and under 
35 U.S.C. § 1 12, first paragraph as allegedly unusable by the skilled artisan due to the alleged lack of 
patentable utility. Therefore, claims 1-5 are the subject of this appeal. A copy of the appealed claims are 
included below in the Appendix (Section IX). 

IV. STATUS OF THE AMENDMENTS 

As no amendments subsequent to the Final Action have been filed, Appellants believe that no 



outstanding amendments exist. 



V. SUMMARY OF THE INVENTION 

The present invention relates to Appellants' discovery and identification of novel human 
polynucleotide sequences that encode a novel G protein-coupled receptor that spans the cellular membrane 
and is involved in signal transduction after ligand binding, and that has structural motifs found in the seven 
transmembrane domain (7TM) receptor family (specification at page 2, lines 9-13, and at page 4, 
lines 20-23). 

The presently claimed polynucleotide sequences were compiled from cDNA clones from human 
adipose and testis cDNA libraries (specification at page 7, lines 13-14). Two coding single nucleotide 
polymorphisms were identified in the claimed sequence - specifically, a T/G polymorphism at position 233 
of SEQ ID NO: 1 , which can lead to a valine or glycine residue at amino acid position 78 of SEQ ID NO:2, 
and a C/T polymorphism at position 3 16 of SEQ ID NO: 1 , which can lead to an arginine or cysteine 
residue at amino acid position 106 of SEQ ID NO:2 (specification at page 7, lines 21-30). 

The specification details a number of uses for the presently claimed polynucleotide sequences, 
including in diagnostic assays such as forensic analysis (see, for example, the specification at page 14, 
lines 5-8), in assessing gene expression patterns, particularly using a high throughput "chip" format (see, 
for example, the specification at page 9, lines 15-17), and in mapping a unique gene to a particular 
chromosome (see, for example, the specification at page 3, lines 36-37). 

VI. ISSUES ON APPEAL 

1. Do claims 1-5 lack a patentable utility? 

2. Are claims 1-5 unusable by a skilled artisan due to a lack of patentable utility? 

VII. GROUPING OF THE CLAIMS 

Forthe purposes of the outstanding rejections under 35 U.S.C. § 101 and35U.S.C. § 112, first 
paragraph, associated with the utility rejection, the claims will stand or fall together. 



VIII. ARGUMENT 

A. Do Claims 1-5 Lack a Patentable Utility? 

The Final Action first rejects claims 1-5 under35 U.S.C. § 101, as allegedly lacking a patentable 
utility due to not being supported by either a specific and substantial or a well-established utility. 

Appellants pointed out both in the Response to the First Action and the Response to the Final 
Action that the present nucleic acid sequences have utility in diagnostic assays, such as forensic analysis, 
as described in the specification as originally filed (see, for example, page 14, lines 5-8). As described in 
the specification on page 7, lines 21-30, the present sequences define two coding single nucleotide 
polymorphisms - specifically, a T/G polymorphism at position 233 of SEQ ID NO: 1 , which can lead to a 
valine or glycine residue at amino acid position 78 of SEQ ID NO:2, and a C/T polymorphism at position 
3 16 of SEQ ID NO: 1 , which can lead to an arginine or cysteine residue at amino acid position 106 of SEQ 
ID NO:2. As such polymorphisms are the basis for forensic analysis, which does not require any 
information at all about the ultimate biological function of the encoded protein, and that is undoubtedly a 
"real world" utility, the presently claimed sequence must in itself be useful. 

Appellants respectfully point out that the presently described polymorphisms are useful in forensic 
analysis exactly as they were described in the specification as originally filed - specifically, to distinguish 
individual members of the human population from one another based simply on the presence or absence 
of one or more of the described polymorphisms. The skilled artisan would be able to use the presently 
described polymorphisms in forensic analysis exactly as they were described in the specification as 
originally filed, without any additional research. It is important to note that simply because the use of these 
polymorphic markers will necessarily provide additional information on the percentage of particular 
subpopulations that contain these polymorphic markers does not mean that additional research is needed 
in order for these markers as they are presently described in the instant specification to be used in forensic 
science. 

This is also not a case of a potential utility. Even in the worst case scenario, the described 
polymorphisms are each useful to distinguish 50% of the population (in other words, the marker being 
present in half of the population). Appellants point out that the ability of a polymorphic marker to 



distinguish at least 50% of the population is an inherent feature of any polymorphic marker, and this feature 
is well understood by those of skill in the art. Appellants note that as a matter of law, it is well settled that 
a patent need not disclose what is well known in the art. In re Wands, 8 USPQ 2d 1400 (Fed. Cir. 1988). 
Appellants respectfully point out that all that is required to support Appellants' assertion of utility is for the 
skilled artisan to believe that the presently described polymorphic markers could be useful in forensic 
analysis. The fact that forensic biologists use polymorphic markers such as those described by Appellants 
everyday provides more than ample support for the assertion that forensic biologists would also be able 
to use the specific polymorphic markers described by Appellants in the same fashion. Therefore, the 
presently claimed sequence clearly has a substantial and well established utility. 

The Examiner first questioned this asserted utility because there is no "precise information about 
the individual from which a sample under analysis was taken" (the Final Action at page 3). Appellants point 
out that this arguments has absolutely no bearing on the assertion that the polymorphisms described by 
Appellants can be used in forensic analysis. As detailed above, forensic analysis merely determines the 
presence or absence of one or more particular polymorphic markers as a means of distinguishing between 
individuals. As such, forensic analysis requires absolutely no information whatsoever about "information 
about the individual from which a sample under analysis was taken". Thus, the Examiner's argument in no 
way supports the allegation that the present claims lack a patentable utility. 

The Examiner further questioned this asserted utility, stating "(i)t is well known in the art of 
molecular biology that the nucleotide sequences encoding an amino acid sequence of any particular protein 
will have inconsequential differences from individual to individual, as will the amino acid sequences encoded 
thereby. This is why all humans are not all identical and why DNA fingerprinting works" (the Final Action 
bridging pages 2 and 3). However, after this admission that the presently described polymorphic markers 
have a well-established utility in forensic analysis, the Examiner states that this is not a specific utility 
because "almost any cDNA can be employed as a forensic marker in some capacity" (the Final Action at 
page 3). Appellants respectfully point out that this argument is flawed in a number of respects. First, 
Appellants submit that the asserted forensic utility is specific precisely because it cannot be applied to just 
any polynucleotide. In fact, the basis for forensic analysis is the fact that such polymorphic markers are not 



present in all other nucleic acids, but in fact specific and unique to only a certain subset of the population. 

This fact is conceded by the Examiner' s statement that " almost any cDN A can be employed as a forensic 

marker". Second, until a polymorphic marker is actually described it cannot be used in forensic analysis. 

Put another way, simply because there is a likelihood, even a significant likelihood, that a particular nucleic 

acid sequence will contain a polymorphism and thus be useful in forensic analysis, until such a polymorphism 

is actually identified and described, such a likelihood is meaningless . The Examiner appears to be 

attempting to use the information presented for the first time by Appellants in the instant specification as 

hindsight verification that the presently claimed sequence would be expected to have polymorphic markers. 

Such hindsight analysis based on Appellants discovery is completely improper. Third, the Examiner is 

clearly confusing the requirement for a specific utility, which is the proper standard for utility under 

35U.S.C. § 101, with the requirement for a unique utility, which is clearly an improper standard. The fact 

that other polymorphic markers have been identified in other genetic loci, or that the use of the presently 

described polymorphic markers will provide additional information concerning the prevalence of these 

markers in certain subpopulations, does not mean that use of the polymorphic markers identified by 

Appellants' in SEQ ID NO: 1 in forensic analysis is not a specific utility. As clearly stated by the Federal 

Circuit in Carl Zeiss Stiftung v. Renishaw PLC, 20 USPQ2d 1101 (Fed. Cir. 1991; "Carl Zeiss"): 

An invention need not be the best or only way to accomplish a certain result, and it need 
only be useful to some extent and in certain applications: "[T]he fact that an invention has 
only limited utility and is only operable in certain applications is not grounds for finding a 
lack of utility." Envirotech Corp. v. Al George, Inc., 221 USPQ 473, 480 (Fed. Cir. 
1984) 

In other words, just because other (possibly better) polymorphic markers from the human genome have 
been described, or that additional information about the presently described polymorphic markers can be 
gained through the use of these markers, does not establish that the presently described polymorphic 
markers lack a specific utility. Furthermore, the requirement for a unique utility is clearly not the standard 
adopted by the Patent and Trademark Office. If every invention were required to have a unique utility, the 
Patent and Trademark Office would no longer be issuing patents on batteries, automobile tires, golf balls, 
golf clubs, and treatments for a variety of human diseases, such as cancer, just to name a few particular 
examples, because the utility of each of these compositions is applicable to the broad class in which each 
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of these compositions falls: all batteries have the same utility, specifically to provide electrical power; all 
automobile tires have the same utility, specifically for use on automobiles; all golf balls and golf clubs have 
the same utility, specifically for use in the game of golf; and all cancer treatments have the same utility, 
specifically, to treat cancer. However, only the briefest perusal of virtually any issue of the Official Gazette 
provides numerous examples of patents being granted on each of the above compositions nearly every 
week . Furthermore, if a composition needed to be unique to be patented, the entire class and subclass 
system would be an effort in futility, as the class and subclass system serves solely to group such common 
inventions, which would not be required if each invention needed to have a unique utility. In view of the 
above standards and "common sense" analysis, there can be little question that the present sequence clearly 
meets the requirements of 35 U.S.C. § 101. 

Appellants pointed out in the Response to the Final Action that the holding in the Carl Zeiss case 
is mandatory legal authority that essentially controls the outcome of the present case. This case, and 
particularly the cited quote, directly rebuts the Examiner's argument, which is presumably why the 
Examiner failed to address the holding of Carl Zeiss in the Final Action, and continues to avoid addressing 
Carl Zeiss in the Advisory Action. Instead of addressing Appellants' arguments, the Examiner merely 
rehashes the standard irrelevant arguments concerning general utility - "that any purified compound having 
a known structure could be employed as an analytical standard in such processes as nuclear magnetic 
resonance (NMR), infrared spectroscopy (IR), and mass spectroscopy as well as in polyacrylamide gel 
electrophoresis (PAGE), high performance liquid chromotography (HPLC) and gas chromotography", and 
that "any item having a constant mass within an acceptable range can be employed to calibrate a produce 
scale in a grocery store" (the Final Action bridging pages 3 and 4). These staid arguments are flawed in 
at least two critical respects. First, as pointed out by Appellants above, the admission on the record by 
the Examiner that " almost any cDNA can be employed as a forensic marker in some capacity" (the Final 
Action at page 3, emphasis added), points to the fact that not all nucleic acids have utility in forensic 
analysis. Thus, utility of nucleic acid sequences that contain defined polymorphic markers in forensic 
analysis is not a general utility. Second, the reason that such utilities as those listed by the Examiner are not 
specific is because these general utilities are applicable to a large number of unrelated compositions. Use 



as a calibration standard for a "produce scale" is a utility that is applicable to any composition, no matter 
how unrelated, that has mass. In other words, a metal block, an automobile, an elephant, or a nucleic acid 
molecule containing a polymorphism could be used to calibrate a produce scale, which is why use as a 
calibration standard for a produce scale is not a specific utility. However, a metal block, an automobile, 
or an elephant cannot be used in human forensic analysis. In fact, only nucleic acids, and specifically those 
human nucleic acids that contain a defined polymorphic marker, can be so used. Thus, these arguments 
also fail to support the Examiner's position. 

Appellants respectfully point out that these arguments only serve to highlight the Examiner' s general 
lack of understanding of forensic analysis. As repeatedly pointed out by Appellants, forensic analysis does 
not require any knowledge about any function of the expressed polynucleotide, or a correlation between 
the presence of any of these polymorphisms and the effect of the presence of any of these polymorphisms 
on the risk of any disease or disorder. Forensic analysis is used to distinguish individual members of the 
human population from one another based simply on the presence or absence of one or more of the 
described polymorphisms. No more and no less is required. No knowledge about the function of the 
encoded protein is required. No nexus between the polymorphic markers and a specific disease or 
disorder is required. The polymorphic markers described by Appellants do not need to be the best 
polymorphic markers, or the only polymorphic markers - they merely need to function as polymorphic 
markers, which is clearly the case. The present polymorphic markers clearly have utility in forensic analysis, 
and, thus, the claims meet the requirements of 35 U.S.C. § 101. 

Furthermore, Appellants pointed out in the Response to the Final Action as the presently described 

polymorphisms are a part of the family of polymorphisms that have a well-established utility, the Federal 

Circuit's holding in In re Brana, (34USPQ2d 1436 (Fed. Cir. 1995), "5rana") is directly on point. In 

Brana, the Federal Circuit admonished the Patent and Trademark Office for confusing "the requirements 

under the law for obtaining a patent with the requirements for obtaining government approval to market a 

particular drug for human consumption". Brana at 1442. The Federal Circuit went on to state: 

At issue in this case is an important question of the legal constraints on patent office 
examination practice and policy. The question is, with regard to pharmaceutical inventions, 
what must the applicant provide regarding the practical utility or usefulness of the invention 



for which patent protection is sought. This is not a new issue; it is one which we would 
have thought had been settled bv case law years ago . 

Brana at 1439, emphasis added. The choice of the phrase "utility or usefulness" in the foregoing quotation 

is highly pertinent. The Federal Circuit is evidently using "utility" to refer to rejections under 

35U.S.C. § 101, and is using "usefulness" to refer to rejections under 35U.S.C. § 112, first paragraph. 

This is made evident in the continuing text in Brana, which explains the correlation between 35 U.S.C. 

§§ 101 and 112, first paragraph. The Federal Circuit concluded: 

FDA approval, however, is not a prerequisite for finding a compound useful within the 
meaning of the patent laws. Usefulness in patent law, and in particular in the context of 
pharmaceutical inventions, necessarily includes the expectation of further research and 
development . The stage at which an invention in this field becomes useful is well before 
it is ready to be administered to humans. Were we to require Phase II testing in order to 
prove utility, the associated costs would prevent many companies from obtaining patent 
protection on promising new inventions, thereby eliminating an incentive to pursue, through 
research and development, potential cures in many crucial areas such as the treatment of 
cancer. 

Brana at 1442-1443, citations omitted, emphasis added. As set forth above, the present polymorphisms 
are useful in forensic analysis as described in the specification as originally filed, without the need for any 
further research. As discussed above, even if the use of these polymorphic markers provided additional 
information on the percentage of particular subpopulations that contain these polymorphic markers, this 
would not mean that "additional research" is needed in order for these markers as they are presently 
described in the instant specification to be of use to forensic science. As stated above, using the 
polymorphic marker as described in the specification as originally field can definitely distinguish members 
of a population from one another. However, even if, arguendo, further research might be required in 
certain aspects of the present invention, this does not preclude a finding that the invention has utility, as set 
forth by the Federal Circuit' s holding in Brana, which clearly states, as highlighted in the quote above, that 
"pharmaceutical inventions, necessarily includes the expectation of further research and development " 
(Brana at 1442-1443, emphasis added). In assessing the question of whether undue experimentation 
would be required in order to practice the claimed invention, the key term is "undue", not 
"experimentation". In re Angstadt and Griffin, 190 USPQ 214 (CCPA 1976). The need for some 



experimentation does not render the claimed invention unpatentable. Indeed, a considerable amount of 

experimentation may be permissible if such experimentation is routinely practiced in the art. In re Angstadt 

and Griffin, supra; Amgen, Inc. v. Chugai Pharmaceutical Co. y Ltd, 18 USPQ2d 1016 (Fed. Cir. 

1991). Again, as a matter of law, it is well settled that a patent need not disclose what is well known in the 

art (In re Wands, supra). 

Appellants respectfully point out that the Examiner has provided absolutely no evidence of record 

that would serve to show that an artisan skilled in the art of forensic analysis would doubt Appellants 

asserted utility. As set forth by Appellants in the Response to the Final Action, it has been clearly 

established that a statement of utility in a specification must be accepted absent reasons why one skilled 

in the art would have reason to doubt the objective truth of such statement. In re hanger, 503 F.2d 1380, 

1391, 183USPQ288,297 (CCPA, 1974; "hanger");In re Marzocchi, 439 F.2d 220, 224, 169USPQ 

367, 370 (CCPA, 1971). As set forth in In re hanger (183 USPQ 288 (CCPA 1974); "hanger"): 

As a matter of Patent Office practice, a specification which contains a disclosure of utility 
which corresponds in scope to the subject matter sought to be patented must be taken as 
sufficient to satisfy the utility requirement of § 101 for the entire claimed subject matter 
unless there is a reason for one skilled in the art to question the objective truth of the 
statement of utility or its scope. 

hanger at 297, emphasis in original. As set forth in the MPEP, "Office personnel must provide evidence 
sufficient to show that the statement of asserted utility would be considered 'false' by a person of ordinary 
skill in the art" (MPEP, Eighth Edition at 2100-40, emphasis added). Thus, absent such evidence from 
the Examiner concerning the use of the presently described polymorphisms in forensic analysis, the present 
claims clearly meet the requirements of 35 U.S.C. § 101. 

Additionally, in the Response to the Final Action, Appellants pointed out that the specification as 
originally filed indicates that the presently claimed sequence is involved in "chemical communication" 
(specification at page 1 , line 28). Appellants further invited the Examiner' s attention to the fact that a 
sequence sharing 100% percent identity at the protein level over the entire length of the claimed sequence 
is present in the leading scientific repository for biological sequence data (GenBank), and has been 
annotated by third party scientists wholly unaffiliated with Appellants as "Homo sapiens similar to 
olfactory receptor MOR40-13" (GenBank accession number XM_291808; alignment shown in 
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Exhibit A), and two sequences sharing nearly 100% percent identity at the protein level over the entire 
length of the claimed sequence are present in the leading scientific repository for biological sequence data 
(GenBank), and have been annotated by third party scientists wholly unaffiliated with Appellants as 
"Homo sapiens similar to olfactory receptor MOR40-13" and "Homo sapiens gene for seven 
transmembrane helix receptor" (GenBank accession numbers XMJ362282 and AB0658 12; alignments 
shown in Exhibit B). Furthermore, the murine olfactory receptor sequence referred to above 
(MOR40- 13) shares over 84% percent identity at the protein level and 9 1 % similarity at the protein level 
with the claimed sequence (GenBank accession numbers NML1463 12 and AY07378 1 ; alignments shown 
in Exhibit C). The legal test for utility simply involves an assessment of whether those skilled in the art 
would find any of the utilities described for the invention to be credible or believable . Given these GenBank 
annotations, there can be no question that those skilled in the art would clearly beHeve that Appellants' 
sequence is an olfactory receptor protein, which is clearly involved in chemical communication. Thus, while 
Appellants have provided evidence of record that conclusively establishes that those skilled in the art would 
believe that the specifically claimed sequence encodes an olfactory receptor protein, the Examiner has 
provided no evidence that directly establishes that the specifically claimed sequence does not encode an 
olfactory receptor protein. Accordingly, the evidence of record compels a finding that the present invention 
clearly meets the requirements of 35 U.S.C. § 101. 

Furthermore, Appellants respectfully point out that the present case appears to directly track 
Example 10 of the Revised Interim Utility Guidelines Training Materials (Exhibit D), which only requires 
a similarity score greater than 95% to establish functional homology. Thus, the present utility rejection must 
fail as a matter of policy, as a matter of science, and as a matter of law. 

Appellants need only make one credible assertion of utility to meet the requirements of 
35 U.S.C. § 101 (Raytheon v. Roper, 220 USPQ 592 (Fed. Cir. 1983); In re Gottlieb, 140 USPQ 665 
(CCPA 1964); In re Malachowski, 189 USPQ 432 (CCPA 1976); Hoffman v. Klaus, 9 USPQ2d 1657 
(Bd. Pat. App. & Inter. 1988)), and thus the question of the utility of the presently claimed invention should 
be laid to rest. However, as admitted by the Examiner in the First Action, the present application describes 
a novel G-protein coupled receptor. Of the pharmaceutical products currently being market by the entire 
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industry, 60% of these drugs target G-protein coupled receptors (Gurrath, 2001, Cum Med, Chem. 
8: 1 605- 1648 ; Exhibit E). Given that more than half of the currently marketed drugs target proteins that 
are structurally (7TM proteins) and functionally (G-protein interaction) related to the presently described 
sequences, a preponderance of the evidence clearly weighs in favor of Appellants' assertion that the skilled 
artisan would readily recognize that the presently described sequences have a specific (the claimed GPCR 
proteins are encoded by a specific locus on the human genome), credible, and well-established utility, for 
example in tracking gene expression. The specification details on page 9, lines 15-17, that the present 
nucleotide sequences have utility in assessing gene expression patterns using high-throughput DNA chips. 
Such "DNA chips" clearly have utility, as evidenced by hundreds of issued U.S. Patents, as exemplified 
by U.S. Patent Nos. 5,445,934 (Exhibit F), 5,556,752 (Exhibit G), 5,744,305 (Exhibit H), 5,837,832 
(Exhibit I), 6, 1 56,50 1 (Exhibit J) and 6,26 1 ,776 (Exhibit K). Evidence of the "real world" substantial 
utility of the present invention is further provided by the fact that there is an entire industry established based 
on the use of gene sequences or fragments thereof in a gene chip format. Perhaps the most notable gene 
chip company is Affymetrix. However, there are many companies that have, at one time or another, 
concentrated on the use of gene sequences or fragments, in gene chip and non-gene chip formats, for 
example: Gene Logic, ABI-Perkin-Elmer, HySeq and Incyte. In addition, one such company (Rosetta 
Inpharmatics) was viewed to have such "real world" value that it was acquired by large a pharmaceutical 
company (Merck) for significant sums of money (net equity value of the transaction was $620 million). The 
"real world" substantial industrial utility of gene sequences or fragments would, therefore, appear to be 
widespread and well established. Clearly, there can be no doubt that the skilled artisan would know how 
to use the presently claimed sequences (see Section VDI(B), below), strongly arguing that the claimed 
sequences have utility. Given the widespread utility of such "gene chip" methods using public domain gene 
sequence information, there can be little doubt that the use of the presently described novel sequences 
would have great utility in such DNA chip applications. As the present sequences are specific markers of 
the human genome (see below), and such specific markers are targets for the discovery of drugs that are 
associated with human disease, those of skill in the art would instantly recognize that the present nucleotide 
sequences would be ideal, novel candidates for assessing gene expression using such DNA chips. Clearly, 
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compositions that enhance the utility of such DNA chips, such as the presently claimed nucleotide 
sequences, must in themselves be useful. Thus, the present claims clearly meet the requirements of 
35 U.S.C. § 101. 

Clearly, persons of skill in the art, as well as venture capitalists and investors, readily recognize the 
utility, both scientific and commercial, of genomic data in general, and specifically human genomic data. 
Billions of dollars have been invested in the human genome project, resulting in useful genomic data (see, 
e. g. , Ven ter et al , 200 1 , Science 291 : 1 304; Exhibit L) . The results have been a stunning success as the 
utility of human genomic data has been widely recognized as a great gift to humanity (see, e.g., Jasny and 
Kennedy, 2001, Science 291: 1 153; Exhibit M). Clearly, the usefulness of human genomic data, such as 
the presently claimed nucleic acid molecules, is substantial and credible (worthy of billions of dollars and 
the creation of numerous companies focused on such information) and well-established (the utility of human 
genomic information has been clearly understood for many years). 

As yet a further example of the utility of the presently claimed polynucleotide, Appellants noted in 
the Response to the First Action that the present nucleotide sequence has a specific utility in "mapping a 
unique gene to a particular chromosome", as described in the specification at least at page 3, lines 36-37. 
This is evidenced by the fact that SEQ ID NO: 1 can be used to map SEQ ID NO: 1 to chromosome 1 1 
(present within two independent chromosome 1 1 clones; GenBank Accession Numbers AC1 16156 and 
AC 109341 ; alignments and the first page from the GenBank reports are presented in Exhibit N). Clearly, 
the present polynucleotide provides exquisite specificity in localizing the specific region of human 
chromosome 1 1 that contains the gene encoding the given polynucleotide, a utility not shared by virtually 
any other nucleic acid sequences. In fact, it is this specificity that makes this particular sequence so useful. 
Early gene mapping techniques relied on methods such as Giemsa staining to identify regions of 
chromosomes. However, such techniques produced genetic maps with a resolution of only 5 to 10 
megabases, far too low to be of much help in identifying specific genes involved in disease. The skilled 
artisan readily appreciates the significant benefit afforded by markers that map a specific locus of the human 
genome, such as the present nucleic acid sequence. 

Appellants respectfully reminded the Examiner that only a minor percentage (2-4%) of the genome 
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actually encodes exons, which in-turn encode amino acid sequences. The presently claimed polynucleotide 
sequence provides biologically validated empirical data {e.g., showing which sequences are transcribed and 
polyadenylated) that specifically define that portion of the corresponding genomic locus that actually 
encodes exon sequence, as described above. Appellants respectfully submit that the practical scientific 
value of biologically validated , expressed and polyadenylated mRNA sequences is readily apparent to those 
skilled in the relevant biological and biochemical arts. For further evidence in support of the Appellants' 
position, the Board is requested to review, for example, section 3 of Venter et ah {supra at 
pp. 1317-1321, includingFig. 11 atpp.l324-1325;seeExhibitL), which demonstrates the significance 
of expressed sequence information in the structural analysis of genomic data. The presently claimed 
polynucleotide sequence defines a biologically validated sequence that provides a unique and specific 
resource for mapping the genome essentially as described in the Venter et ah article. Thus, the present 
claims clearly meet the requirements of 35 U.S.C. § 101. 

The Examiner's main argument concerning these asserted utilities is that, once again, other nucleic 
acid sequences can be used in a similar fashion - "almost any cDN A can be . . . used as a chromosomal or 
tissue marker or in a gene chip for expression profiling" (the Final Action at page 3). Appellants once again 
point out that these arguments are completely rebuffed by the Federal Circuit ' s holding in Carl Zeiss, supra 
("[A]n invention need not be the best or only way to accomplish a certain result"). 

Regarding the utility requirements under 35 U.S.C. § 101 , the Federal Circuit has clearly stated 
' '(t)he threshold of utility is not high: An invention is 'useful' under section 10 1 if it is capable of providing 
some identifiable benefit." Juicy Whip Inc. v. Orange Bang Inc., 185 F.3d 1364, 51 USPQ2d 1700 
(Fed. Cir. 1999) (citing Brenner v. Manson, 383 U.S. 519, 534 (1966)). Additionally, the Federal Circuit 
has stated that "(t)o violate § 101 the claimed device must be totally incapable of achieving a useful result." 
Brooktree Corp. v. Advanced Micro Devices, Inc. ,977 F.2d 1555, 1571, 24 USPQ2d 1401 (Fed. Cir. 
1992), emphasis added. Cross v. lizuka (753 F.2d 1040, 224 USPQ 739 (Fed. Cir. 1985); "Cross") 
states " any utility of the claimed compounds is sufficient to satisfy 35 U.S.C. § 101". Cross at 748, 
emphasis added. Indeed, the Federal Circuit recently emphatically confirmed that "anything under the sun 
that is made by man" is patentable ( State Street Bank & Trust Co. v. Signature Financial Group Inc. , 
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149 R3d 1368, 47 USPQ2d 1596, 1600 (Fed. Cir. 1998), citing the U.S. Supreme Court's decision in 
Diamond vs. Chakrabarty, 447 U.S. 303, 206 USPQ 193 (U.S., 1980)). Thus, based on the relevant 
case law, the present claims clearly meet the requirements of 35 U.S.C. § 101. 

Finally, While Appellants are well aware of the new Utility Guidelines set forth by the USPTO, 
Appellants respectfully point out that the current rules and regulations regarding the examination of patent 
applications is and always has been the patent laws as set forth in 35 U.S.C. and the patent rules as set 
forth in 37 C.F.R., not the Manual of Patent Examination Procedure or particular guidelines for patent 
examination set forth by the USPTO. Furthermore, it is the job of the judiciary, not the USPTO, to 
interpret these laws and rules. Appellants are unaware of any significant recent changes in either 
35U.S.C. § 101, orin the interpretation of 35 U.S.C. § 101 by the Supreme Court or the Federal Circuit 
that is in keeping with the new Utility Guidelines set forth by the USPTO. This is underscored by numerous 
patents that have been issued over the years that claim nucleic acid fragments that do not comply with the 
new Utility Guidelines. As examples of such issued U.S. Patents, the Board is invited to review U.S. Patent 
Nos. 5,817,479 (Exhibit O), 5,654,173 (Exhibit?), and 5,552,281 (Exhibit Q; each of which claims 
short polynucleotides), and recently issued U.S. Patent No. 6,340,583 (Exhibit R; which includes no 
working examples), none of which contain examples of the "real-world" utilities that the Examiner seems 
to be requiring. Additionally, the Office has recently issued U.S. Patent 6,043,052 (Exhibits), which 
concerns an "orphan" G-Protein coupled receptor identified based only on homology to the orphan 
receptor GPR25, similar to the situation with Appellants' currently claimed sequence. Importantly, this 
issued patent also contains no examples of the "real world" utilities seemingly required in the present case. 
As issued U.S. Patents are presumed to meet all of the requirements for patentability, including 
35 U.S.C. §§ 101 and 1 12, first paragraph (see Section Vm(B), below), Appellants submit that the 
present polynucleotides must also meet the requirements of 35 U.S.C. § 101 . While Appellants understand 
that each application is examined on its own merits, Appellants are unaware of any changes to 
35U.S.C. § 101, orin the interpretation of 35 U.S.C. § 101 by the Supreme Court or the Federal Circuit, 
since the issuance of these patents that render the subject matter claimed in these patents, which is similar 
to the subject matter in question in the present application, as suddenly non-statutory or failing to meet the 
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requirements of 35 U.S.C. § 101. Thus, holding Appellants to a different standard of utility would be 
arbitrary and capricious, and, like other clear violations of due process, cannot stand. 

For each of the foregoing reasons, Appellants submit that the rejection of claims 1-5 under 
35 U.S.C. § 101 must be overruled. 

B. Are Claims 1-5 Unusable Due to a Lack of Patentable Utility? 

The Final Action next rejects claims 1-5 under35U.S.C. § 112, first paragraph, since allegedly 
one skilled in the art would not know how to use the invention, as the invention allegedly is not supported 
by either a clear asserted utility or a well-established utility. 

The arguments detailed above in Section VDI(A) concerning the utility of the presently claimed 
sequences are incorporated herein by reference. As the Federal Circuit and its predecessor have 
determined that the utility requirement of Section 101 and the how to use requirement of Section 112, first 
paragraph, have the same basis, specifically the disclosure of a credible utility {In re Brana, supra; In re 
Jolles, 628 F.2d 1322, 1326 n.ll, 206 USPQ 885, 889 n.ll (CCPA 1980); In re Fouche, 439 F.2d 
1237, 1243, 169 USPQ 429, 434 (CCPA 1971)), Appellants submit that as claims 1-5 have been shown 
to have "a specific, substantial, and credible utility", as detailed in Section VEI(A) above, the present 
rejection of claims 1-5 under 35 U.S.C. § 112, first paragraph, cannot stand. 

Appellants therefore submit that the rejection of claims 1-5 under 35U.S.C. § 112, first paragraph, 
must be overruled. 



-16- 



IX. APPENDIX 

The claims involved in this appeal are as follows: 

1 . (Previously Presented) An isolated nucleic acid molecule comprising a nucleotide sequence that 
encodes the amino acid sequence of SEQ ID NO:2. 

2. (Previously Presented) An isolated nucleic acid expression vector comprising a nucleotide 
sequence encoding the amino acid sequence of SEQ ID NO: 2, said vector having the property of being 
capable of expressing the amino acid sequence of SEQ ID NO: 2 when present in a suitable host cell. 

3 . (Previously Presented) The isolated nucleic acid molecule of claim 1 , wherein said nucleotide 
sequence comprises the sequence of SEQ ID NO:l. 

4. (Previously Presented) The isolated nucleic acid expression vector of claim 2, wherein said 
nucleotide sequence comprises the sequence of SEQ ID NO: 1 . 

5. (Previously Presented) A host cell comprising the expression vector of claim 2. 
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X. CONCLUSION 

Appellants respectfully submit that, in light of the foregoing arguments, the Final Action's conclusion 
that claims 1 -5 lack a patentable utility and are unusable by the skilled artisan due to a lack of patentable 
utility is unwarranted. It is therefore requested that the Board overturn the Final Action's rejections. 

Respectfully submitted, 



December 29. 2003 /Q*^S 2** 

Date David W. Hibler Reg. No. 4 1 ,07 1 

Agent For Appellants 

LEXICON GENETICS INCORPORATED 
(281)863-3399 



Customer # 24231 
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>XM_291808 ACCESSION: XM_2 91808 NID: gi 29743652 ref XM_291808.1 

Homo sapiens similar to olfactory receptor MOR40-13 [Mus 
musculus] (LOC340982), mRNA 
Length = 975 

Score = 644 bits (1643), Expect = 0.0 

Identities = 324/324.(100%), Positives = 324/324 (100%) 
Frame = +1 

Query: 1 MNHMSASLKISNSSKFQVSEFILLGFPGIHSWQHWLSLPLALLYLSALAANTLILIIIWQ 60 

MNHMSASLKI SNSSKFQVSEFILLGFPGIHSWQHWLSLPLALLYLSALAANTLILI I IWQ 
Sbjct: 1 MNHMSASLKISNSSKFQVSEFILLGFPGIHSWQHWLSLPLALLYLSALAANTLILIIIWQ 180 

Query: 61 NPSLQQPMYIFLGILCMVDMGLATTIIPKILAIFWFDAKVISLPERFAQIYAIHFFVGME 120 

NPSLQQPMYIFLGILCMVDMGLATTIIPKILAIFWFDAKVISLPERFAQIYAIHFFVGME 
Sbjct: 181 NPSLQQPMYIFLGILCMVDMGLATTIIPKILAIFWFDAKVISLPERFAQIYAIHFFVGME 360 

Query: 121 SGILLCMAFDRYVAICHPLRYPSIVTSSLILKATLFMVLRNGLFVTPVPVLAAQRDYCSK 180 

SGILLCMAFDRYVAICHPLRYPSIVTSSLILKATLFMVLRNGLFVTPVPVLAAQRDYCSK 
Sbjct: 361 SGILLCMAFDRYVAICHPLRYPSIVTSSLILKATLFMVLRNGLFVTPVPVLAAQRDYCSK 540 

Query: 181 NEIEHCLCSNLGVTSLACDDRRPNSICQLVLAWLGMGSDLSLIILSYILILYSVLRLNSA 240 

NEIEHCLCSNLGVTSLACDDRRPNSICQLVLAWLGMGSDLSLIILSYILILYSVLRLNSA 
Sbjct: 541 NEIEHCLCSNLGVTSLACDDRRPNSICQLVLAWLGMGSDLSLIILSYILILYSVLRLNSA 720 

Query: 241 EAAAKALSTCSSHLTLILFFYTIVWISVTHLTEMKATLIPVLLNVLHNIIPPSLNPTVY 300 

EAAAKALSTCSSHLTLILFFYTIVWISVTHLTEMKATLIPVLLNVLHNIIPPSLNPTVY 
Sbjct: 721 EAAAKALSTCSSHLTLILFFYTIVWISVTHLTEMKATLIPVLLNVIiHNIIPPSLNPTVT 900 

Query: 301 ALQTKELRAAFQKVLFALTKEIRS 324 

ALQTKELRAAFQKVLFALTKEIRS 
Sbjct: 901 ALQTKELRAAFQKVLFALTKEIRS 972 



>XM_062282 ACCESSION:XM_062282 NID: gi 29746563 ref XM_062282.7 

Homo sapiens similar to olfactory receptor MOR40-13 [Mus 
musculus] (LOC120806), mRNA 
Length = 975 

Score = 641 bits (1635) , Expect =0.0 

Identities = 323/324 (99%), Positives = 323/324 (99%) 

Frame = +1 

Query: 1 MNHMSASLKI SNSSKFQVSEFILLGFPGIHSWQHWLSLPLALLYLSALAANTLILI I IWQ 60 

MNHMSASLKI SNSSKFQVSEFILLGFPGIHSWQHWLSLPLALLYLSALAANTLILI I IWQ 
Sbjct: 1 MNHMSASLKISNSSKFQVSEFILLGFPGIHSWQHWLSLPLALLYLSALAANTLILIIIWQ 180 

Query: 61 NPSLQQPMYIFLGILCMVDMGLATTIIPKILAIFWFDAKVISLPERFAQIYAIHFFVGME 120 

NPSLQQPMYIFLGILCMVDMGLATTIIPKILAIFWFDAKVISLPE FAQIYAIHFFVGME 
Sbjct: 181 NP S LQQ PMY I FLG I LCMVDMGLATT I I PK I LAI FWFDAKVISL PEC FAQIYAIHFFVGME 360 

Query: 121 SGILLCMAFDRYVAICHPLRYPSIVTSSLILKATLFMVLRNGLFVTPVPVLAAQRDYCSK 180 

SGILLCMAFDRYVAICHPLRYPSIVTSSLILKATLFMVLRNGLFVTPVPVLAAQRDYCSK 
Sbjct: 361 SGILLCMAFDRYVAICHPLRYPSIVTSSLILKATLFMVLRNGLFVTPVPVLAAQRDYCSK 540 

Query: 181 NEIEHCLCSNLGVTSLACDDRRPNSICQLVLAWLGMGSDLSLIILSYILILYSVLRLNSA 240 

NEIEHCLCSNLGVTSLACDDRRPNSICQLVLAWLGMGSDLSLIILSYILILYSVLRLNSA 
Sbjct: 541 NEIEHCLCSNLGVTSLACDDRRPNSICQLVLAWLGMGSDLSLIILSYILILYSVLRLNSA 720 

Query: 241 EAAAKALSTCSSHLTLILFFYTIVWISVTHLTEMKATLIPVLLNVLHNIIPPSLNPTVY 300 

EAAAKALSTCSSHLTLILFFYTIVWISVTHLTEMKATLIPVLLNVI.HNIIPPSLNPTVY 
Sbjct: 721 EAAAKALSTCSSHLTLILFFYTIVVVISVTHLTEMKATLIPVLLNVLHNIIPPSLNPTVY 900 

Query: 301 ALQTKELRAAFQKVLFALTKEIRS 324 

ALQTKELRAAFQKVLFALTKEIRS 
Sbjct: 901 ALQTKELRAAFQKVLFALTKEIRS 972 



>AB065812 ACCESSION: AB065812 NID: gi 21928889 dbj AB065812.1 Homo 
sapiens gene for seven transmembrane helix receptor, 
complete cds, isolate :CBRC7TM_375 
-s Length = 1366 

Score = 641 bits (1635), Expect = 0.0 

Identities =323/324 (99%), Positives = 323/324 (99%) 

Frame = +3 

Query: 1 MNHMSASLKISNSSKFQVSEFILLGFPGIHSWQHWLSLPLALLYLSALAANTLILIIIWQ 60 

MNHMSASLKISNSSKFQVSEFILLGFPGIHSWQHWLSLPLALLYLSALAANTLILIIIWQ 
Sbjct: 192 MNHMSASLKISNSSKFQVSEFILLGFPGIHSWQHWLSLPLALLYLSALAANTLILIIIWQ 371 

Query: 61 NPSLQQPMYIFLGILCMVDMGLATTIIPKILAIFWFDAKVISLPERFAQIYAIHFFVGME 120 

NPSLQQPMYIFLGILCMVDMGLATTIIPKILAIFWFDAKVISLPE FAQIYAIHFFVGME 
Sbjct: 372 NPSLQQPMYIFLGILCMVDMGLATTIIPKILAIFWFDAKVISLPECFAQIYAIHFFVGME 551 

Query: 121 SGILLCMAFDRYVAICHPLRYPSIVTSSLILKATLFMVLRNGLFVTPVPVLAAQRDYCSK 180 

SGILLCMAFDRYVAICHPLRYPSIVTSSLILKATLFMVLRNGLFVTPVPVLAAQRDYCSK 
Sbjct: 552 SGILLCMAFDRWAICHPLRYPSIVTSSLILKATLFMVljRNGLFVTPVPVLAAQRDYCSK 731 

Query: 181 NEIEHCLCSNLGVTSLACDDRRPNSICQLVLAWLGMGSDLSLIILSYILILYSVLRLNSA 240 

NE I EHC LC SNLGVT S LAC DDRR PNS I C QLVL AWLGMG SDLSLIILSYILI LYS VLRLNS A 
Sbjct: 732 NEIEHCLCSNLGVTSLACDDRRPNSICQLVLAWLGMGSDLSLIILSYILILYS VLRLNS A 911 

Query: 241 EAAAKALSTCSSHLTLILFFYTIVWI SVTHLTEMKATLI PVLLNVLHNI I PPSLNPTVY 300 

EAAAKALSTCSSHLTLILFFYTIVVVISVTHLTEMKATLIPVLLNVLHNIIPPSLNPTVY 
Sbjct: .912 EAAAKALSTCSSHLTLILFFYTIVWI SVTHLTEMKATLI PVLLNVLHNI I PPSLNPTVY 1091 

Query: 301 ALQTKELRAAFQKVLFALTKEIRS 324 

ALQTKELRAAFQKVLFALTKEIRS 
Sbjct: 1092 ALQTKELRAAFQKVLFALTKEIRS 1163 



>NM_146312 ACCESSION:NM_146312 NID: gi 22129666 ref NM_146312.1 Mus 
musculus olfactory receptor MOR40-13 (MOR40-13), mRNA 
Length = 960 

Score = 532 bits (1355), Expect = e-149 

Identities = 264/312 (84%), Positives = 286/312 (91%) 

Frame = +1 

Query: 4 MSASLKISNSSKFQVSEFILLGFPGIHSWQHWLSLPLALLYLSALAANTLILIIIWQNPS 63 

MSASLK NSSK QVSEFILLGFPGIHSWQHWLSLP LLYLSA+ N LILIII Q+PS 
Sbjct: 1 MSASLKAFNSSKSQVSEFILLGFPGIHSWQHWLSLPFTLLYLSAIGTNVLILI I ICQDPS 180 

Query: 64 LQQPMYIFLGILCMVDMGLATTIIPKILAIFWFDAKVISLPERFAQIYAIHFFVGMESGI 123 

L+QPMY+FLGIL + VDMGL ATT I + PK I LA I F WFD AKVI S L P E FAQIYAIH FVGMESGI 
Sbjct: 181 LKQPMYLFLGILSWDMGLATTIMPKILAIFWFDAKVISLPECFAQIYAIHCFVGMESGI 360 

Query: 124 LLCMAFDRYVAICHPLRYPSIVTSSLILKATLFMVLRNGLFVTPVPVLAAQRDYCSKNEI 183 

LCMAFDRYVAIC+PLRY SI+T+SLILKATLFMVLRNGL V PVPVLAAQR+YCS+NEI 
Sbjct: 361 FLCMAFDRYVAICYPLRYSSIITNSLILKATLFMVLRKGLCVIPVPVLAAQRNYCSRNEI 540 

Query: 184 EHCLCSNLGVTSLACDDRRPNSICQLVLAWLGMGSDLSLIILSYILILYSVLRLNSAEAA 243 

+HCLCSNLGVTSLACDDRRPNSICQL+LAW+GMGSDL LIILSY LIL SVLRLNSAEA 
Sbjct: 541 DHCLCSNLGVTSLACDDRRPNSICQLILAWVGMGSDLGLIILSYTLILRSVLRLNSAEAV 720 

Query: 244 AKALSTCSSHLTLILFFYTIVWISVTHLTEMKATLIPVLLNVLHNIIPPSLNPTVYALQ 303 

+KAL+TCSSHL LILFFYT+VWISVTHL+E KATLIPVLLNV+HNI PPSLNP VYAL+ 
Sbjct: 721 SKALNTCSSHLILILFFYTVVWISVTHLSETKATLIPVLLNVMHNITPPSLNPIW 900 

Query: 3 0 4 TKELRAAFQKVL 315 

T++LR FQKVL 
Sbjct: 901 TRQLRQGFQKVL 936 



>AY073781 ACCESSION:AY073781 NID: gi 18480859 gb AY073781.1 Mus 

musculus olfactory receptor MOR40-13 gene, complete cds 
Length = 960 

Score = 532 bits (1355), Expect = e-149 

Identities = 264/312 (84%), Positives = 286/312 (91%) 



Frame 


= +1 




Query : 


4 


MSASLKISNSSKFQVSEFILLGFPGIHSWQHWLSLPLALLYLSALAANTLILI I IWQNPS 


63 






MSASLK NSSK QVSEFILLGFPGIHSWQHWLSLP LLYLSA+ N LILIII Q+PS 




Sbjct : 


1 


MSASLKAFNSSKSQVSEFILLGFPGIHSWQHWLSLPFTLLYLSAIGTNVLILI I ICQDPS 


180 




64 


LQQPMYIFLGILCMVDMGLATTIIPKILAIFWFDAKVISLPERFAQIYAIHFFVGMESGI 


123 






L+QPMY+FLGIL +VDMGLATTI + PKILAIFWFDAKVISLPE FAQIYAIH FVGMESGI 




Sbjct: 


181 LKQPMYLFLGILSWDMGLATTIMPKILAIFWFDAKVISLPECFAQIYAIHCFVGMESGI 


360 




124 


LLCMAFDRWAICHPLRYPSIVTSSLILKATLFMVI.RNGLFVTPVPVI.AAQRDYCSKNEI 


183 






LCMAFDRYVAIC + PLRY SI+T+SLILKATLFMVLRNGL V PVPVLAAQR+YCS+NEI 




Sbjct : 


361 


FLCMAFDRWAICYPLRYSSIITNSLILKATLFMVLRNGLCVIPVPVLAAQRNYCSRNEI 


540 


Query : 


184 


EHCLCSNLGVTSLACDDRRPNSICQLVLAWLGMGSDLSLIILSYILILYSVLRLNSAEAA 


243 






+HCLCSNLGVTSLACDDRRPNSICQL+LAW+GMGSDL LIILSY LIL SVLRLNSAEA 




Sbjct: 


541 


DHCLCSNLGVTSLACDDRRPNSICQLILAWVGMGSDLGLIILSYTLILRSVLRLNSAEAV 


720 


Query : 


244 


AKALSTC S SHLTL I LF F YT I WVT S VTHLTEMKATL I PVLLNVLHNI I PPSLNPTVYALQ 


303 






+KAL+TCSSHL LILFFYT+VWISVTHL+E KATLIPVLLNV+HNI PPSLNP VYAL+ 




Sbjct : 


721. 


S KALNTC S SHL I L I L F F YTVWVI S VTHL S ETKATL I PVLLNVMHN I T P P S LNP I VYALR 


900 


Query: 


304 


TKELRAAFQKVL 315 








T++LR FQKVL 




Sbjct : 


901 


TRQLRQGFQKVL 936 





characterize the protein. A starting material that can only be used to produce 
a final product does not have a substantial asserted utility in those instances 
where the final product is hot supported by a specific and substantial utility. 
In this case none of the proteins that are to be produced as final products 
resulting from processes involving the claimed cDNA have asserted or 
identified specific and substantial utilities. The research contemplated by 
Applicants to characterize potential protein products, especially their 
biological activities, does not constitute a specific and substantial utility. 
Identifying and studying the properties of the protein itself or the 
mechanisms in which the protein is involved does not define a "real world" 
context of use. Note, because the claimed invention is not supported by a 
specific and substantial asserted utility for the reasons set forth above, 
credibility has not been assessed: Neither the specification as filed nor any 
art of record discloses or suggests any property or activity for the cDNA 
compounds such that another non-asserted utility would be well established 
for the compounds. 

Claim 1 is also rejected under 35 U.S.C. § 1 12, first paragraph. 
Specifically, since the claimed invention is not supported by either a specific 
and substantial asserted utility or a well established utility for the reasons set 
forth above, one skilled in the art would not know how to use the claimed 
invention. 

Example 10: DNA Fragment eroding a Full Open Reading Frame 
(ORF) 

Specification: The specification discloses that a cDNA library was prepared 
from human kidney epithelial cells and 5000 members of this library were 
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sequenced and open reading frames were identified. The specification 
discloses a Table that indicates that one member of the library having SEQ 
ID NO: 2 has a high level of homology to a DNA ligase. The specification 
teaches that this complete ORF (SEQ ID NO: 2) encodes SEQ ID NO: 3. 
An alignment of SEQ ID NO: 3 with known amino acid sequences of DNA 
ligases indicates that there is a high level of sequence conservation between 
the various known ligases. The overall level of sequence similarity between 
SEQ ID NO: 3 and the consensus sequence of the known DNA ligases that 
are presented in the specification reveals a similarity score of 95%. A search 
of the prior art confirms that SEQ ID NO: 2 has high homology to DNA 
Ligase encoding nucleic acids and that the next highest level of homology is 
to alpha-actin. However, the latter homology is only 50%. Based on the 
sequence homologies, the specification asserts that SEQ ID NO: 2 encodes a 
DNA ligase. 

Claim 1: An isolated and purified nucleic acid comprising SEQ ID NO: 2. 

Analysis: The following analysis includes the questions that need to be 
asked according to the guidelines and the answers to those questions based 
on the above facts: 

1) Based on the record, is there a "well established utility" for the 
claimed invention? Based upon applicant's disclosure and the results of the 
PTO search, there is no reason to doubt the assertion that SEQ ID NO: 2 
encodes a DNA ligase. Further, DNA ligases have a well-established use in 
the molecular biology art based on this class of protein's ability to ligate 
DNA. Consequently the answer to the question is yes. 
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Note that if there is a well-established utility already associated with the 
claimed invention, the utility need not be asserted in the specification as 
filed. In order to determine whether the claimed invention has a well- 
established utility the examiner must determine that the invention has a 
specific, substantial and credible utility that would have been readily 
apparent to one of skill in the art. In this case SEQ ID NO: 2 was shown to 
encode a DNA ligase that the artisan would have recognized as having a 
specific, substantial and credible utility based on its enzymatic activity. 

Thus, the conclusion reached from this analysis is that a 35 U.S.C. § 
101 rejection and a 35 U.S.C. § 1 12, first paragraph, utility rejection should 
not be made. 

F.Yainple 1 1 : Animals with Ilncharacte rfreri Human Genes 

Specification: Kidney cells from a patient with Polycystic Kidney (PCK) 
Disease have been used to make a cDNA library. From this library 8000 
nucleotide "fragments" have been sequenced but not yet used to express 
proteins in a transformed host cell nor have they been characterized in any 
other way. The 50 longest fragments, SEQ ID NO: 1-50, respectively, have 
been used to make transgenic mice. None of the 50 lines of mice have 
developed Polycystic Kidney Disease to date. The asserted utility is the use 
of the mice to research human genes from diseased human kidneys. The 
disease is inheritable, but chromosomal loci have not yet been identified. 
Neither the absence or presence of a specific protein has been identified with 
the disease condition. 
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Abstract- Over the last decades distinct members of the G Protein-Coupled Receptor 
(GPCR) family emerged as prominent drug targets within pharmaceutical research, since 
approximately 60 % of marketed prescription drugs act by select.vely addressing 
representatives of that class of transmembrane signal transduction systems It is 
noteworthy that the majority of GPCR-targeted drugs elicit the.r b.o og.cal activity by 
selective agonism or antagonism of biogenic monoamine receptors, while the development 
status of peptide-binding GPCR-adressing compounds is still in its infancy. 
Exemplified on selected medicinal chemistry projects, this review will focus on the opportunities of 
therapeu c intervention into a broad spectrum of disease processes through agomz.ng or anta g0 n, 2 ,ng the 
fS of ^pt e-bindin g GPCRs. In this context, a brief overview of GPCR-med ated s.gna. transduc ion 
pathways will be given in order to emphasize the biomedical relevance of a controlled modulation of receptor 
C I trends on lead finding and optimization strategies for peptide-bind.ng OPCR-.argeted ^low- 
molecular weight compounds will be highlighted on the basis of current research programs conducted m the 
" s of a^tensin ... endothe.in. bradykinin. neurokinin, neuropeptide Y. LHRH C5a an,agom«s and 
somatostatin agonists, respectively. Special emphasis will be laid on the elaboration and ut.liza t on or 
structral rationales on the potential drug candidates, thus facilitating more detailed msights into the 
underlying molecular recognition event. 



INTRODUCTION 

Current pharmaceutical research is going through a period 
of unprecedented change, since new revolutionizing 
techniques have been successfully implemented into the 
pharmaceutical discovery process. At the same time, 
pharmaceutical industry feels growing pressure to release 
more new chemical entities (NCEs) that evolve as highly 
selective drugs targeting therapeutic areas of unmet medical 
need and address novel mechanisms of action. These 
attributes clearly define an ideal set of preconditions for 
positioning a candidate with block buster potential onto the 
drug market [1-3). The conceptual combination of automated 
combinatorial chemistry, multiple parallel synthesis with 
high-throughput screening has dramatically altered the 
process of lead finding in medicinal chemistry in that vast 
numbers of low molecular weight compounds can rapidly be 
screened against biological target systems [4]. This progress 
in medicinal chemistry is paralleled on the side of target, 
identification and validation with the maturation of 
genomics, proteomics, and bioinformatics in pharmaceutical 
research [5|. Taken together, these novel methodologies are 
expected to facilitate and accelerate the overall drug discovery 
process significantly. 

However, the judicious choice of a disease relevant target 
is still one of the most crucial steps in initiating a drug 
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discovery project, both in terms of novelty and uniqueness of 
the underlying therapeutic principle, as well as the 
competitor situation [2]. 

In this context, the superfamily of transmembrane G 
protein-coupled receptors (GPCRs) emerged as the most 
prominent class of qualified drug targets for pharmaceutical 
research and biomedical application [6]. Approximately 60% 
of all commercially available drugs work by selective 
modulation of distinct members of this target family [7]. 
Even though an estimated number of 1000 to 2000 GPCRs 
is expected to exist in the human genome [8], current 
GPCR-targeted therapeutic principles exploit a surprisingly 
small fraction of the GPCR family known today. A strong 
bias exists among the GPCR-targeted drugs in favour of the 
subclass of biogenic monoamine-stimulated GPCRs, i.e. the 
classical neurotransmitter-binding receptors [9,10]. 

This review will focus on the opportunity to further 
expand the spectrum of drug- targeted GPCRs onto the huge 
subclass of peptide-binding representatives of that target 
family. After a brief introduction on the basic principles of 
receptor structure and function, the chemically diverse set ot 
endogenous ligands will be discussed with the aim to 
emphasize the relevance of peptide-binding GPCRs for 
modern drug discovery. 

The lead identification and optimization attempts 
discussed in this contribution are restricted on projects that 
are aimed to identity peptidomimetic or non-peptide agonists 
or antagonists. Numerous pharmaceutical research efforts 
conducted over the last two decades have clearly proven the 
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relevance of an early pharmacokinetic profiling. 
Consequently, satisfactory metabolic stability and oral 
bioavailability demand a transfer of the peptide-encoded 
biological and structural information onto non-peptide, drug- 
like scaffolds in order to achieve the desired goal [11-13], 

Classical attempts towards drugs selectively addressing 
peptide-binding GPCRs will be exemplified on the 
angiotensin II and endothelin receptor antagonists. In both 
areas, leads were identified by screening programs and further 
optimized by classical medicinal chemistry approaches to 
yield clinical candidates, some of which already entered the 
market. The classical approach of optimizing screening hits 
will further be introduced with medicinal chemistry 
programs aimed to identify active compounds for a 
modulation of the bradykinin, neurokinin, and NPY 
(neuropeptide Y) receptors. Since the area of peptide-binding 
GPCR compounds is still in its infancy, especially when 
compared to the situation of biogenic amine-binding receptor 
drugs, the actual state of the majority of projects discussed in 
this review is still in the preclinical or in early clinical 
phases. Apart from random lead finding attempts, structural 
rationales are more frequently used in recent times, 
precedented by studies on somatostatin, bradykinin, 
neurokinin, LHRH (luteinizing hormone-releasing hormone), 
and anaphylatoxin C5a receptor agonists and antagonists that 
will be discussed briefly. Structural rationales were mainly 
derived from an educated guess on the bioactive 
conformation of the endogenous peptide or protein ligand, 
thus offering the opportunity to follow an indirect drug 
design approach. 




GPCR SUPERFAMILY 

G protein-coupled receptors constitute the largest receptor 
family known today [8]. According to an analysis of the C 
elegans genome [14], approximately 5% of the 19100 
nematode genes encode GPCRs with a family distribution 
profile that is reminiscent to that of mammalian GPCR 
genes. Extrapolation of these findings would suggest that up 
to 5000 distinct GPCR-encoding genes exist within the 
human genome (5% of an estimated 100000 genes). 
Currently, more than 800 distinct members of the GPCR 
superfamily have been cloned from various species, ranging 
from fungi over plants, yeast, slime mould, protozoa, 
metazoa to humans. Apart from the sensory olfactory 
receptors, approximately 150 human GPCRs have been 
cloned for which also the endogenous ligands have been 
identified. Further, more than 100 GPCRs are known with 
unidentified ligands and unknown physiological relevance, 
so called orphan GPCRs, which undoubtedly represent a rich 
source of disease-relevant drug targets for future biomedical 
research [15-17]. 



Structure and Function of GPCRs 

GPCRs belong to the class of integral plasma membrane 
proteins and share a common receptor protein topology 
throughout the entire family. The structure paradigm is a 
seven helix bundle that spans the cell membrane in an 
almost perpendicular orientation, thereby establishing a 
functional link between the exterior and the cytoplasm of the 




Fig. (1). Side-by-side stereo presentation of the Cot trace model of rhodopsin derived from various biophysical and bioinformatics 
studies. The helix bundle is shown in a sideview, the extracellular compartment being on the top. For details see references [22-31]. 



Pcptidc-Binding C Protein-Coupled Receptors 

cell [6,18-20]. The seven transmembrane sequence stretches 
can be identified by hydrophobicity analyses since they 
exhibit an increased hydrophobic signature in a 
corresponding hydrophobicity profile. From numerous 
biophysical and biochemical studies it is now general 
accepted that GPCRs intercalate into the cell membrane with 
their ^-terminus in the extracellular compartment, while the 
C-terminus reaches into the cytoplasm of the cell. The seven 
transmembrane helices (7TM domain) that constitute the 
central core domain of all GPCRs, are sequentially connected 
by extracellular and intracellular loops. Apart from variations 
in the primary structure, GPCRs differ in length of these 
loops, as well as in length and function of both N- and C- 
termini. The ACTH (adrenocorticotropic hormone) receptor 
is one of the smallest GPCRs known with 297 residues. 
Biogenic monoamine receptor sequences cover a size from 
approximately 350 to 600 residues, peptide receptor 
sequences are found between 400 and 750 residues, while the 
mGluRs (metabotropic glutamate receptor) mark the upper 
boundary consisting of roughly 1200 amino acid residues 
[21]. 

Even though no high-resolution structure of any 
pharmaceutical relevant member of the GPCR superfamily 
has been determined by e.g. x-ray crystallography, low 
resolution models derived from electron cryo-microscopy and 
electron diffraction of bovine, frog and squid rhodopsin reveal 
a detailed picture of the insertion mode of each helix within 
the context of the transmembrane helix bundle domain (Fig. 
(I)) [22-31]. 

From a functional point of view, GPCRs share a 
common property in that they work as transmembrane 
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transducer systems by transferring an extracellular message 
across the cell membrane, thus allowing the affected tissue to 
respond to a broad range of signalling molecules [32-35]. 
Upon extracellular binding of the molecular stimulus, the 
central core domain (7TM domain) is believed to undergo a 
conformational change, thereby transmitting the extracellular 
binding event into the cytoplasm (Fig; (2)). The binding of 
a receptor agonist leads to an intracellular interaction of the 
receptor protein with its cognate heterotrimeric GDP-bound 
G protein. The agonist-promoted conformational change of 
the receptor protein followed by the cytoplasmic G protein- 
coupling initiates the activation of intracellular effector 
systems by the G protein cycle (Fig. (2)). The coupling 
event catalyzes the exchange of GDP against GTP and the 
dissociation of the GTP-bound a subunit from the Py 
heterodimer. Depending on the very nature of the G protein 
a subtype, different effector systems such as enzymes (e.g. 
adenylyl cyclase, phospholipase C) or ion channels are 
functionally modulated, which substantially amplifies the 
production of second messengers. The effector activation 
event is accompanied by a GTPase activity of the a subunit 
releasing inorganic phosphate. The GDP-bound form 
converts the a subunit to exhibit high affinity for the py 
heterodimer, finally forming the GDP-bound heterotrimeric 
G protein again. The modulated concentration of second 
messengers elicits phosphorylation cascades across the 
cytoplasm to the nucleus, eventually activating the final 
physiological response of a cell to the original extracellular 
stimulus. Even though this functional paradigm accounts for 
all known GPCRs, this obvious convergence after the ligand 
binding event is diversified by the selective activation of 
only distinct types of G proteins from which e.g. numerous 
different G a subunits are known (Fig. (2)) [32-35]. 




Fig. (2). Schematic representation of the ligand-GPCR interaction mediated G protein cycle. 
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In order to fully characterize the mechanism of action of 
GPCRs, a thermodynamic "eight-state-model" has been 
developed as a mechanistic hypothesis describing the 
macroscopic properties of transitions among distinct 
conformational stales (Fig. (3)) [36]. The simplest way to 
describe the ligand-induced receptor activation event is a 
ternary complex model (A) that defines two distinct affinity 
states of the receptor for agonist binding, notably the free 
receptor (Rec) and the G protein-bound form K<f*Rec) {fig. 
(3)A). Agonists would display high affinity to the G protein- 
associated receptor, while antagonists would exhibit only 
low-affinity towards the complex. With the discovery that 
GPCRs can activate G proteins in the absence of any 
agonist, the simple ternary complex model required an 
extension. To account for the existence of such cons litutively 
active GPCRs, a receptor activation step in the unliganded 
form was introduced (Fig. (3)B). This receptor isomenration 
hypothesis resulted in a "six-state-model in which i the 
activated receptor (Rec*) is capable of signalling in both the 
G protein-associated form (G-Rec*), and m the ternary 
complex (G'Rec*'Lig). The problem with thai I receptor 
activation-extended ternary complex model is that the o 
protein only binds to the receptor in its activated form Rec . 
Experimental evidence clearly suggests that G proteins do 
alsobind to the resting state {Rec) without subsequent G 
protein activation. To account for these findings and to rcter 
to the microscopic reversibility principle of thermodynamics, 
an "eight-state-model" was proposed in which the receptor 
protein can undergo three distinct processes. wneft'O) 
ligand binding, (ii) receptor isomerization and (in) G 
protein binding (Fig. (3)C). Agonists can bind to four 
different receptor states clearly favouring the activated states 
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generated either by receptor isomerization or by G protein 
association. Inverse agonists would prefer to bind the non- 
activated groundstate {Rec), while partial agonists show 
affinity to both receptor states but still cause receptor 
activation. In the thermodynamic "eight-state-model" an 
antagonist would just block the interconversion of different 
states rather than preferably bind to distinct states (Fig. (3)) 
136]. 

In order to address phenomena such as isosteric or 
allosteric antagonism, structural models with atomic 
resolution are mandatory that are actually frequently used for 
both rationalizing structure-activity relationships of low 
molecular weight agonists and antagonists, as well as 
understanding the results from site-directed mutagenesis 
experiments. A detailed discussion of the actual status of 
experimentally derived, and molecular modeling derived 
GPCR structures [37] is beyond the scope of this review, 
since this contribution is mainly aimed to introduce the 
currently applied technologies to identify compounds 
selectively modulating peptide-binding GPCRs. 



GPCR Classification 

Exhaustive sequence analysis revealed three major 
homology families for the mammalian GPCRs, notably the 
family 1 or rho-family (prototype: rhodopsin), the family 2 
or scr-family (prototype: secretin receptor), and the family 3 
or mGluR family (prototype: metabotropic glutamate 
receptors) receptors (Fig. (4)) [32-35]. Family 1 receptors are 
divided into further subfamilies according to the size and 
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Fig. (4). Sequence homology-de rived classification of GPCRs. Each GPCR family is characterized by a common ligand binding mode. 



chemical nature of their corresponding agonists, as well as 
the mode of ligand binding. Family la accommodates the (3- 
adrenoceptor-type receptors that are activated by small 
ligands such as biogenic monoamines, opiates, nucleotides, 
and small peptides, that comparably bind to a 
transmembrane cavity formed by helices 3, 4, 5, and 6. 
Family lb is composed of receptors stimulated by 
oligopeptides and proteins such as IL-8 (interleukin-8), 
cytokines, and thrombin. The ligand binding epitope is 
located in the extracellular loop region. Family lc receptors 
recognize glycoprotein hormones such as LH (luteinizing 
hormone). TSH (thyroid-stimulating hormone), and FSH 
(follicle-stimulating hormone) while their ligand binding site 
is centred in a large extracellular //-terminal domain (Fig. 
(4)). 

Family 2 receptors are distinct from rho-family receptors 
in that they bind large peptides like glucagon, secretin, PTH 
(parathyroid hormone), VIP (vasointestinal peptide), or CRF 
(corticotropin-releasing factor). Comparable to family lc 
receptors, the secretin family utilizes a large //-terminal 
domain for ligand binding. Family 3 receptors are unique 
since they possess a large extracellular //-terminal domain of 
several hundred residues that constitutes the binding site for 
smallish ligands such as a single divalent Ca 2+ cation, 
gtutamate, GABA (y-amino butyric acid), and pheromones 
(Fig. (4)). 



On the light of this classification, peptide-binding 
receptors are not structurally homogenous since they belong 
to family 1 and 2. Consequently, correlation of sequence 
homology with ligand similarity remains questionable 
which is also reflected by the mutual different binding modes 
of peptidic and non-peptidic agonists and antagonists. 



Ligand Variety 

GPCRs are stimulated by an amazingly large number of 
agonists covering a broad range of chemical diversity. 
Ligands are as small as divalent cations, biogenic 
monoamines such as acetylcholine or serotonin, fragrances 
and taste molecules such as aspartam or limonen, single 
amino acids such as glutamate or GABA, or nucleotide 
analogues such as adenosine. Medium-sized ligands range 
from cannabinoids over prostaglandines to small 
oligopeptides such as enkephalins, angiotensin II, 
bradykinin, somatostatin, and tachykinins. Larger 
oligopeptides and globular proteins constitute the family of 
macromolecular ligands including e.g. neuropeptide Y, C5a 
anaphylatoxin, interleukin-8, or chemokines. Even 
proteolytic enzymes such as thrombin, which activates its 
receptor by cleaving off an //-terminal peptide, selectively 
bind to distinct members of the GPCR superfamily. Apart 
from their important role in sensory perception including 
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cm-ll and taste GPCRs are obviously optimized by 

Seni ecmpoLd classes, i.e. nucleosides Up* med.aton, 
neurotransmitter, peptides, and proteins [6,1 8.38]. 

In this context, it is interesting to note that the majority 
of GPCR-targeted therapeutic principles exploit only a s.ngle 
compound class, notably the neurotransmitters. When the 
number of currently identified neurotransmitter receptors is 
compared with the number of disease-relevant peptide- 
binding GPCRs, an obvious imbalance becomes apparent n 
that only a small number of peptide-binding GPCRs is 
targeted by established therapies. Agonism and antagonism 
of e e. o and B adrenoceptors, dopamine, histamine 
seroionin, or muscarinic acetylcholine receptors are well 
established therapeutic principles for numerous best-selling 
drugs covering virtually all therapeutic areas, including 
gastrointestinal, cardiovascular, and CNS indications In 
contrast, only two peptide-binding GPCR families are 
addressed by marketed non-peptide drugs, namely the opioid 
receptors and the angiotensin II receptor -However the 
importance of peptide- and protein-binding G PCRs for drug 
Every conSes to be manifested by the fact that across 
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cunxnt pharmaceutical research, especially in industry, 
numerous projects are pursued to identify leads that, upon 
optimizations fulfil ail pharmacodynamic and 
pharmacokinetic demands required for clinical applicability 
(Table 1). 



CLASSICAL LEAD FINDING AND DRUG 
DEVELOPMENT 

Currently applied drug design and discovery approaches 
are typically classified as rational or random, depending on 
whether or not structural rationales are employed. The area of 
GPCR agonists and antagonists research is mainly driven by 
screening approaches in which large numbers of randomly 
selected chemical entities are tested in high-throughput 
screens. These shotgun procedures provide a practical means 
for identifying new leads for a particular receptor. In the 
following, this classical approach for GPCR-targeted drug 
discovery will be exemplified with prototype studies 
conducted on the angiotensin II, endothelin, bradyktmn, 
neurokinin, and NPY receptors, respectively. 



Table 1. Selection of endogenous 



Peptides that Eiert their Biological Activity by Selective Activation of a GPCR 




angiotensin receptors 
bombesin receptors 



bradykinin receptors 

C3a receptor 
C3a receptor 
CC chemokine receptors 
CXC chemokine receptors 
cholecystokinin/gasirin receptors 

endothelin receptors 
alph a factor pheromone receptor 
fMet-L eu-Phe re ceptor 

galanin 
melanocortin receptors & 



AT|.AT 2 
BBl -BB4 




neuropeptide Y receptor 



neurotensin recepto r 
opioid receptors 



nociceptin receptor 
somatostatin receptors 
tachykinin receptors 



thrombin / protease-activalcd 
receptors 



Bt,B 2 

C3aR " 
C5aR 
CCRl -CCR9 
CXCR1 - CXCR5 
CCK A ,CCKb 

ET A ,ET B 

STE2, STE3 

fMLP-R 

GAU,gal2, ga!3 

MC!.MCi,MC4.MC5 
- « ACTH receptor 



NTSl,nts2 
6 



ORL1 
SSt| -sst5 
NK) 
NK 2 



PARI.PAR2.PAR3.PAR4 



angiotensin 11 (All) 



bombesin, neuromedin B, gastrin- 
rcleasing peptide 



bradykinin (BK) 

C3a anaphylatoxin 

C5a anaphylatoxin 

chemokines 

chemokines 

cholecystokinin (CCK), 

gastrin 

en dothelin-1 (ET-1). ET-2, ETO 

f ungal mating pheromones 

Formylpeplide(fMLP) 

galanin 

melanocortin (MSH) 
adrenocorticotropic hormone 
(ACTH), corticotropin 



neurotensin 
[ Metl-cnkephal in, [ Uu]-enkephalin 



dynorphin A 
^-endorphin, Lipotropin C fragment 



noc icep tin, orphanin FQ 

somatostatin 

substance P 

neurokinin A (NKA), substance K 
neuromedin L 

neurokinin B (NKB), neuromedin K 
thrombin, trypsin, factor Xa 



Asp-Arg-Val-Tyr-lle-His-Pro-Phe 
14 aa peptide amide 



Arg-Pro-Pro-Gly-Phe-Ser-Pro-Phe- 
Arg 

protein 

protein 

proteins 

proteins 

33 aa peptide amide, 
17 aa peptide amide 

2 1 aa peptides 

13 aa peptide 

FMet-Uu-Phe 

30 aa peptide 



39 aa peptide 



13 aa peptide 
Tyr-Gly-Gly-Phe-Met/Lcu 



17 aa peptide 
3 1 aa peptide 



17 aa peptide 
cyclic! 4 aa peptide 

1 1 aa peptide 

His-Lys-Thr-Asp-Ser-Phe-Val-Gly- 

Asp-Met-His-Asp-Phe-Phe-Val-Gly- 
Lcu-Met-NH; 
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(Tabic I), could. 



GPCR 


code 


native ligand (peptide/protein) 1 


nature of the ligand 


vasopressin receptors 


VlA.ViB. V 2 


vasopressin 


Cys- 1 yr-rfie-Oin-Asn-uys-rro-rtrg- 
Gly-NH 2 


oxytocin receptor 


OT 


oxytocin 


Cy s-Tyr-I le-G In- Asn-Cys-rro- Leu- 
Gly-NH 2 


vasotocin receptor 


VT 


vasotocin 


Cys-Tyr-lle-Gln-Asn-Cys-Pro-Arg- 
Gly- NH 2 


orcxin receptors 


OX |, OX2 


orexin A/B 


33 aa/28 aa peptide amides 


FSH receptor 


FSH receptor 


follicle-stimulating hormone (FSH) 


protein 


LSH receptor 


LSH receptor 


lutropin, choriogonadotropic 
hormone, lutcnizing hormone 




TSH receptor 


TSH receptor 


thyrotropin, thyroid-stimulating 
hormone 




LHRH receptor 


LHRH receptor 


gonadotropin-rcleasing hormone 
(GnRH), luteinizing hormone- 
releasing hormone (LHRH) 


oGlu-His-TrD-Ser-Tvr-Gly-Leu-Are- 
Pro-Gly- NH 2 


thyrotropin-relcasing hormone & 
secretagoguc receptors 


TRHi,trh2 


thyrotropin-releasing 
hormone/factor (TRH/F) 


pGlu-His-Pro-NH 2 


OHS receptor 


GHSRu, GHSRjb 


growth hormone secretagogues 
(GHS) 


oligopeptides 


calcitonin/calcitonin gene-related 
peptide receptors 


CGRPR 


calcitonin, calcitonin gene-related 

rvntide fCGRP) 


32 aa peptide amide 


amy) in receptor 


amylin receptor 


amylin 


37 aa peptide amide 


adreriomedullin receptor 


adrenomedullin receptor 


adrenomedullin 


52 aa peptide amide 


corticotropin-releasing factor 
receptor 


CRF l( CRF 2 


corticotropin-releasing factor 


4 1 aa peptide amide 


gastric inhibitory peptide receptor 


gip receptor 


gastric inhibitory peptide (GIP) 


42 aa peptide 


glucagon/glucagon-! ike peptide 
receptor 


GLPI 


glucagon 


29 aa peptide 


growth hormone-releasing hormone 
receptor 


GHRH receptor 


growth hormone-releasing 
hormone/factor (GHRH/GRF) 


44 aa peptide amide 


parathyroid hormone receptor 


type 1 , type 2 


parathyroid hormone (PTH) 


84 aa peptide 


secretin receptor 


secretin receptor 


secretin 


27 aa peptide amide 


vasoactive intestinal peptide & 
PACAP receptor 


VPAC l( VPAC 2 . 
PAC| 


vasoactive intestinal peptide (VIP) 
pituitary adenylate cyclase 
activating peptide (PACAP) 


28 aa peptide amide 
38 aa peptide 



Angiotensin-!! Antagonists 
Biomedical Significance 

The endogenous octapeptide hormone angiotensin-II (A- 
H) (Table I), Asp-Arg-Val-Tyr-Ile-His-Pro-Phe, is the key 
effector compound of the renin-angiotensin system (RAS) 
which is one of the main blood pressure and electrolyte/fluid 
homeostasis regulating system in mammals [39]. As a result 
of a proteolytic cascade starting with angiotensinogen, 
angiotensin-H is released from Us precursor decapeptide 
angiotensin-I by the action of angiotensin-! converting 
enzyme (ACE), the latter being a qualified target of 
antihypertensive drugs [40]. The conversion from 
angiotensinogen to angiotesin I is catalyzed by the aspartic 
protease renin, peplide-type inhibitors of which have not yet 
reached an advanced state of clinical development [41 ]. A-U 
interacts specifically with two different receptor subtypes of 



the GPCR superfamily, notably the ATj and the AT 2 
receptor, respectively [21]. Interaction with the ATi receptor 
causes severe vasoconstriction, aldosterone release, 
vasopressin secretion, and rena! sodium reabsorption. These 
effects convergently result in a dramatic increase of 
extracellular fluid volume, thus giving rise for a significant 
hypertensive effect. Therapeutic intervention into the RAS 
clearly offers major clinical and commercial success as shown 
with the ACE inhibitors for the treatment of hypertension 
and congestive heart failure [40]. Due to the fact that ACE 
inhibitors cause dry cough and angioedema [42], new 
strategies have been sought to block the vasocontrictory 
activities of the biologically active player, A-H [43]. Specific 
inhibition of the A-II target receptor interaction, the final step 
of the RAS, offers an entirely new and selective approach to 
blocking this regulatory system regardless of the source of 
the biological active peptide. And indeed, selective 
nonpeptide A-II antagonists emerged as a new class of 






Fig. (5). Structures of marketed All antagonists. 

antihypertensives on the cardiovascular drug market 
exemplified by the released drugs Losartan 1 [44,45], 
Valsartan 2 [46], Eprosartan 3 [47] f lrbesartan 4 [48], 
Candesartan 5 [49], and Telmisartan 6 [50], respectively 
(Fig. (5))- 

Consequently, the angiotensin receptor represents one of 
the most advanced drug targets from the family of peptide- 
binding (non-opioid) GPCRs in the sense that screening hits 
have successfully been transferred to leads, further to 
development candidates that finally reached the drug market 
as save and innovative drugs introducing a new therapeutic 
principle. 

Lead Finding 

In the search for A-II antagonists potent peptides have 
been synthesized in a classical ligand-based design concept, 
yielding e.g. [Sar , ) Ala 8 ]-Angiotensin-Il ) commonly termed 
Saralasin [51]. However, all these peptides display limited 
therapeutic value as potential antihypertensives due to their 
poor oral bioavailability, rapid excretion, structural 
complexity, and significant agonistic profiles (51,52). 




The feasibility of identifying nonpeptide AT receptor 
binding compounds with purely antagonistic profile was 
demonstrated by a research group at Takeda Chemical 
Industries in 1982. In a series of two patents, Furukawa and 
co-workers reported on the inhibition of angiotensin-11- 
induced contractile response in rabbit aorta by numerous 
different l-benzylimidazole-5-aceticacid derivatives (Fig. (6)) 
[53]. The two compounds S-8307 7 and S-8308 8 mark the 
beginning of a new era of antihypertensive drug research in 
which almost any pharmaceutical company attempted to 
derive new compounds from that initial findings. 

Drug Development 

The Takeda compounds served as lead structures for the 
development of highly potent and selective analogues at 
DuPont that culminated in Losartan 1 (DuP-753, EXP- 
77 11), the first nonpeptide A-II antagonist that got approval 
by the FDA and reached the market (Fig. (7)). Guided my 
molecular modeling studies, the substitution pattern of the 
benzylic phenyl-ring was changed yielding EXP-6155, 9 
which displayed a ten-fold increased binding affinity over 
e.g. S-8307 7 [54]. Further extension in para-position of the 
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Fig. (6). Initial lead structures disclosed by Takeda Chemical Industries. 
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Fig. (7). Development of Losartan t. 

aromatic ring resulted in more potent analogues as shown 
with EXP-6803 10 [55]. 

The deletion of the interaromatic carboxamide linkage 
yielding biphenylmethyl-substituted imidazote-5-acetic acid 
derivatives produced orally active compounds and 
subsequent exchange of the or/Ao-carboxylic acid on the 
terminal aromatic ring against the tetrazole moiety further 
improved the oral activity [56,57]. The imidazole-5-acetic 
acid substituent was modified to the corresponding alcohol 



in the analogue chosen as clinical candidate. However, later 
it could be shown that the parent acetic acid sidechain of the 
imidazole core is the active metabolite of Losartan 1 [58]. 

Instead of modifying the N-l substituent of the Takeda 
imidazole derivatives, 7 and 8, SmithKline Beecham 
decided to explore the 5 position in more detail (Fig. (8)). 
Introduction of an acrylic acid in that position (II) resulted 
in a 15-fold enhancement in binding affinity. Further 
introduction of a 2-thienylmethyl group in a-position of the 
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Fig. (8). Development of Eprosartan 3. 
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Fig. (9). Next-generation "sartans" in advanced states of clinical development 



acrylic acid substituent (12) together with a modification ,,„ 
the N-l benzylic substituent finally yielded SK&F-108566. 
3 159,60] which inhibits A-ll binding to its receptor in the 
single digit nanomolar range |61]. 

The Ciba compound CGP-48933, 2 (Fig. (5)) is the 
result of an optimization process attempting to replace the 
imidazole ring structure originally described by Takeda pjj. 
The l-benzyl-2-butyl-4-chloro-imidazole-5-acetic acid is 
replaced with an /V-terminally acylated amino acid notably 
valine. CGP-48933, 2 has passed the clinical development 
and reached the market as Valsartan [62]. It is clearly beyond 
the scope of this review to systematically summarize the lead 
optimization programs pursued by the ^ different 
pharmaceutical companies, however, it should ! be 
emphasized that, apart from the currently marketed drugs, 
numerous next-generation compounds and follow-ups in late 
clinical development are expected to get approved m the near 
future (Fig. (9)) [63.64]. These new •'sartans (13 - 20) 
together with the first generation drugs (1 - 6) will further 
change the landscape of antihypertensive prescription drugs 
since they clearly introduced a new quality ot 



antihypertensive principles into therapy of cardiovascular 
diseases. 

Apart from these biomedical aspects, the development of 
the "sartans" acting specifically on a member of the GPCR 
superfamily evolved to a textbook example of protein- 
targeted drug design within modern medicinal chemistry 
[65]. 



Endothelin 

Biomedical Significance 

Endothelin 1 (ET-1) is a 21 amino acid bicyclic peptide 
(Table 1) that was initially isolated from porcine aortic 
endothelial cells [66]. The endothelins constitute a class of 
three related isopeptides (ET-1, ET-2, ET-3) 
exhibiting vasoconstrictive and mitogenic potential [68] 
upon binding to two receptor subtypes, notably the ET A and 
ET B receptor [69,70]. ET-1 selectively binds to the ET A 
receptor which is expressed on vascular smooth muscle cells 
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Fig- (*0)* Aryl-sulfonamide-type ET antagonists. 

(lung, aortic, heart) and mediates vasoconstriction and 
proliferation through activation of a complex intracellular 
signalling cascade [71,72]. The ETb receptor, localized in 
the brain, on vascular endothelial cells, and smooth muscle 
cells, is responsible for vasodilation via the release of nitric 
oxide, prostacyclin, and adrenomedullin [73,74], In 
addition, ETb functions as a clearance receptor for 
endogenous ET by the internalization of the receptor- 1 igand 
complex. On the other hand, ETb may also cause 
vasoconstriction in some tissues [75], ET^ and ETb 
receptors share high sequence similarity (app. 68%). ET-l is 
predominantly produced by endothelial cells acting in an 
autocrine and paracrine fashion as a mediator of vascular 
function. Elevated ET levels has been observed in tissue and 
plasma in a number of cardiovascular disorders, thereby 
contributing to disease states including hypertension [76], 
vasospasm, atheriosclerosis [77], acute myocardial infarction 
[78], congestive heart failure [79,80], restenosis [81], 
subarachnoid hemorrhage, ischemia, pulmonary hypertension 
[82], and renal failure [83]. Due to the pivotal 
pathophysiological role of the endothelin receptor-ligand 
interaction, this receptor system emerged as a promising 
target for therapeutic intervention in the disease states 
mentioned above [84]. 



Lead Finding 

Since the discovery of ET-l in 1988, a large number of 
potent antagonists have been described [84]. The first 
antagonists emerging from random screening efforts have 
been reported in 1992. These first generation compounds 
comprise anthraqui nones from Streptomyces misakiensis, 
steroids isolated from bay berry, Myrica cerifera, and 
diphenyl ethers discovered in fungal broths [85]. Lead 
finding in this field is mainly based on compound library 



screening followed by classical lead optimization within 
medicinal chemistry programs. A number of peptide-based 
antagonists have been reported including the prominent 
cyclic pentapeptide BQ-123, and other peptide antagonists, 
e.g. BQ-788, FR-139317, PD145065, PD156252, RES- 
701-1, TAK-044, and IRL2500 [84-88]. 

As mentioned above, this review, will focus on the 
development of nonpeptide antagonists emerging from those 
programs directed towards the discovery of active low 
molecular weight compounds. Primarily, the ET A -selective 
antagonists as well as antagonists exposing mixed ET^/ETb 
affinity play a major role for therapeutic intervention, even 
though some ETa-selective antagonists have been reported 
only recently. 

Ar yl Svlfop amides 

Bristol Myers Squibb designed BMS182874, 21, a 
nonpeptide ET A -selective antagonist from an initial hit 
which was discovered by screening of a sulfathiazole library 
[89]. The sulfonamide BMS 182874, 21, exhibits an IC 50 
value of 150 nM at the ET A receptor (A10 cells) and shows 
no binding affinity to the ETb receptor (Fig. (10)). 

From a similar series of compounds, 
Immunopharmaceuticals (Texas Biotech.) developed an 
isoxaolyl-thiophene sulfonamide, TBC-11251, 2 2 
(Sitaxsentan) [90], This orally active compound has shown 
efficacy in phase II clinical trial of congestive heart failure 
(CHF) and demonstrated activity in a rat model of 
myocardial infarction and acute hypoxia-induced pulmonary 
hypertension (PH) [91). Further investigations established a 
unique pharmacophore framework, characterized by a central 
thiophene subunit for selective ET^ antagonism [92]. 
Maintaining the sulfonamide substituent in position 3 and 
altering the substituent in position 2 in the thiophene ring 
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Fig. (11). Butenolide-type ET antagonists. 

led to a series of compounds with enhanced P", a ™ aco >^«; 
properties. TBC-2576, 23, the optimal analogue . , th s 
series showed about 10-fold higher ET A bind.ng afrm.ty 
SS^Sit«en«». 22, and high ETa^*«¥ as 
well as a serum half-life of 7.3 h .n rats, pa.red w.th m v,vo 
activity (Fig. (10)) 192]. 

A number of nonpeptide ET A /ET B antagonist abased on 
a pyrimidyl-benzene sulfonamide scaffold have been reported. 
TrTfirst example for an orally active representative is Ro46- 
2005 24 f93. which was obtained after optimization of a 
Fead compounds identified by random screening m an 
antidiabetic project. The binding affinities of Ro46-2005, 24 
SS20S I (ET A ). /CrlOOO nM (ET B )) could further be 
opiimized yielding^ Wyrimidyl-benzene analogue Ro4 - 
0203, 25 (Bosentan) which represents an ^Provement m 
both receptor binding affinities K-,=4.7 nM (ET A ),Kr95 
nM (ETb)& oral activity (Fig. (10)) [94],Bosentar . 25 i» 
I compete mixed ET A /ET B 

promising results in clinical trials 188] in terms of 

vasodilation. Further, it ' m P r ° ves ,- yVhrbeneficial 
performance and reduces renal dysfunction Th 
Effects of Bosentan 25 have been characterized in CHF 
models in hypertension related exper.ments and in 
Tba achnoid hemorrhage (SAH) trials^ These ^er 
potential applications have been described in a recent review 
by Roux el al [88]. 

Rntenolides 

Cl-1020, also known as PD1 56707, 26. 27 [95] emerged 
from the optimization of an initial lead structure winch wa 
identified from library screening (Fig. (11))- 
$M^U» Foo^ure w'as guided by 'folding the .Topics 
"decision tree" approach based on QSAR principles [96]. Cl- 
1020 26 27 represents the first clinical candidate emerging 
[1 the Parke P Davis series of butenolides With - K* 
value of 0.30 nM on recombinant hum™ ET A g receptor 
riPc n =780 nM (ET R )) it demonstrates high ET A -seiecuy.xy 
26^Wd) 26,27 undergoes —^.0. 

hereby establishing the y-hydroxy butenohd «^ 
under acidic conditions, while at bas.c pH *^" ,l < ^ u "^* 
Shifted in favour of the ring-opened Y -keto acid salt structure 
26 1951 The poor water-solubility of this compound, caused 
" cy iizTtton. has driven the drug. development process 
towards a series of water-soluble ring-closed y-hydroxy 
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butenolides applicable for parenteral use [97]. One ofthe 
follow-up compounds exhibits promising pharmacological 
profiles by displaying improved activity compared to Cl- 
1020, 26, 27 e.g. in preventing acute hypoxia-mduced 
pulmonary hypertension (PH) in rats. 

Most promising characteristics were found for an 
analogue containing the sodium salt of a sulfonic ac.d in 
compound 28 (Fig. (11)) [97]. It shows hi gh ET A - 
selectivity (4200-fold) with an 1C 50 value of 0.38 nM (ET A ) 
and ET A functional activity of X B =7.8, which is similar or 
even superior to the progenitor Cl-1020 26, 27. Moreover .1 
displays improved water-solubility and- shows higher 
activity after /.v. infusion in preventing acute hypox.a- 
induced PH in rats (ED 5O =0.3 ug/kg/h) when compared to 
Cl-1020 26 27 [97]. The new compounds are currently 
evaluated in preclinical trials, while Cl-1020 26,27 has 
already been tested in a model of acute stroke and has entered 
clinical development for cerebral ischemia. 

fndane Ca rhnxvlic Acids 

SB209670 29 emerged from the SmithKline Beecham 
laboratories after optimization of an initial hit discovered 
from compound library screening (Fig. (12)) [98]. Within a 
molecular modeling-driven approach based on a comparison 
of the NMR-derived conformation of ET-1 with the primary 
hit, an indene carboxylic acid derivative, the mixed 
ET a /ETb receptor antagonist SB209670 29 was designed 
(^=0.43 nM (AET A ), K i=14 .7 nM (AET B » When 
administered /.v. SB209670 29 shows efficacy in differen 
animal models of ET-mediated disease states, e.g. renal 
failure, hypertension [84], and ischemia-induced stroke. Due 
to the low oral bioavailability (4%) a structurally related 
analogue, SB2 17242 30 [99] was investigated that displays 
improved pharmacokinetics and bioavailability [86]. 
SB209670 29 is under development (phase 1) for acute r v. 
indications with efficacy in pulmonary hypertension (PH). 
chronic renal failure (CRF) and stroke [87], while SB217242 
30 (phase 1) is in development for chronic PH and chronic 
obstructive pulmonary disease (COPD) [87,100], 

PymrlitiiK f nrhftx.v lic Acids 

The SmithKline Beecham compound SB209670 29 
(Fig. (12)) served as template for the design of the 
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Fig. (12). Indane carboxylic acid-type ET antagonists. 

pyrrolidine carboxylic acid A-127722 rac-31 (Fig. (13)) 
[1011, that has been disclosed as a potent, ET A -selective 
antagonist, currently tested in clinical trials (PH, CHF) [87]. 
A-127722 rac-31 was reported to prevent dose-dependently 
cerebral oedema in stroke-prone spontaneously hypertensive 
rats [100]. ABT-627 31, the active enantiomer (2/?,3K,4S) of 
the trans-trans configurated 2,3, 4-tri substituted pyrrolidine 
ring, shows an IC50 value of 0.08 nM on ET A and 8.1 nM 
on ET B [102]. The 1800-fold selectivity was dramatically 
altered by subtle structural modifications of A-127722 rac- 
31, which led to A-182026 32 with an ET A /ET B selectivity 
ratio of 3, thus being the most potent balanced dual 




ET A /ET B antagonist known today. Replacement of the 
dialkyl-acetamide (rac-31) against a 2,6-dialkyl-acetanilide 
resulted in an ET B -se1ective antagonist, A- 192621 33 
exhibiting promising pharamcological properties [103]. 
Combination of the structure-activity relationships (SAR) 
derived from the first series of ET A -selective compounds 
(e.g. ABT-627 31) and the second series of ET B -selective 
antagonists (e.g. A-192621 33) led to a further optimized 
series of compounds. Therein A-308165 34 has been 
identified as highly selective (27000-fold), orally acitve ET B 
antagonist [104]. 




Fig. (13). Pyrrolidine carboxylic acid-type FT antagonists. 
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Administration of ET B -selective antagonist led to 
hypertensive responses indicating that they are not suitable 
as agents for a long-term systemic single ET B -directea 
therapy [1031. Nevertheless, ET B -sclective antagonists are 
expected to be a valuable tool for the elucidation of the role 
of the ET B receptor action under norma and 
pathophysiological conditions [104]. Most recently an 
ET A -selective antagonist, derived by optimization of A- 
127722 rac-31, emerged from the series of pyrro idine-based 
compounds |105]. A-2 16546 35 is a further orally active ET 
receptor antagonist showing >25000-fold selectivity for the 
ETa receptor (^=0.46 nM), and is considered for clinical 
development as a therapeutic agent for chronic treatment of 
ET-1 -mediated diseases [106]. Compound 36 (1C 50 =5.6 nM 
(ETa)" >10000-fold selectivity) is currently under 
investigation at Abbott's Laboratories as ET A antagonist 
Apart from the ET receptor affinity. A-2 16546 35 showed 
remarkable inhibition potential for numerous members ot the 
GPCR superfamily such as adenosine receptors, 8-opioid 
receptor, purinergic receptor, etc. [106], thus indicating a 
kind of "ligand crosstalk" which turns out to be a common 
phenomenon of GPCR-targeted compounds. 

Phenvlacetamides 

L-749 329 37 (Fig. (14)) is an orally active, competitive 
and nonselective ET A /ET B antagonist developed by Merck 
inhibiting the binding of[ l25 l]ET-l in Chinese Hamster 
Ovary (CHO) cells expressing human ET receptors with 
IC 50 values of 0.8 nM (ET A ) and 16 nM (ET B ), respectively 
[107]. The active enantiomer, L-754.142 37, is a potent 
orally active ET antagonist with a long duration of action in 
several in vivo models. L-754,142 37 shows binding affinity 
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towards ET A (0.062 nM) and ET l} (2.25 nM) and 
antagonizes ET- 1 -induced phosphatidyl inositol hydrolysis 
in CHO celts expressing cloned human ET receptors with 
IC 50 values of 0.35 nM (ET A ) and 26 nM (ET B ) [108]. 
Substitution of the ether oxygen against a methylene group 
resulted in L-75 1,281 38, an analogue with similar activities 
on both ET receptor subtypes [ 1 07] . 

itt.Phenoxvp henvlacetic Acids 

At the Merck laboratories, structural modifications of an 
initial lead discovered by screening for angiotensin 11 (All) 
antagonists, led to a dual AT|/ET antagonist. Further 
optimization towards ET A -selectivity resulted in L-744,453 
39 (Fig. (15)), an a-phenoxyphenylacetic acid derivative 
lacking the sulfonamide present in the ary lacy Sulfonamides 
L-749,329 37, and L-751-281 38 (107). L-744,453 39 
competitively and reversibly inhibits [ l25 I]ET-l binding to 
CHO cells expressing cloned human ET receptors with K \ 
values of 4.3 nM (ET A ), and 232 nM (ET B ). Thus, within 
L-744,453 39 the shift from an originally angiotensin II 
antagonist to an ET-selective antagonist could be 
demonstrated, thus highlighting the potential of "cross- 
fertilization" of projects devoted to representatives of a 
common receptor superfamily. 
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Fig. (14). Phenylacetamide-type ET antagonists. 
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Fig. (15). Aryloxyacetic acid-type ET antagonists. 

ff-Arvloxvacetic Acids 

Also at the BASF laboratories, the endothelin project 
started with screening of the in-house chemical substance 
stock. The initial lead, which was originally intended as a 
herbizide, was optimized by systematic structural 
modifications resulting in an ET A -selective antagonist, 
LU135252 40 (Fig. (15)), the active (S)-configurated 
enantiomer of LU 127043 1109,1 10]. It selectively binds to 
the ET A receptor with high affinity {K\=2 nM (ET A ), 
K-,=184 nM (ET B )) [111]. LU 135252 40 has been evaluated 
in clinical trials for preventing restenosis [87] and entered 
phase II for CHF [112]. Furthermore, it was demonstrated 
that selective ET A receptor inhibition with LU 135252 40 
could reduce ischemia-induced ventricular arrhythmias in 
pigs. Thus ET antagonism might reduce mortality by 
preventing arrhythmias, a major cause of death in CHF, 
obviously induced by the pro-arrhythmogenic effects of ET-1 
[100]. 
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Pfrpnnxvbutap ^ Acids anq Stilbgne acids 

According to a previously elaborated SAR study, A sties 
et al at Rhone-Poulenc Rorer presented the optimized 
analogue RPR-111844 41 (Fig. (16)), which cAibits an 
ICso of 5.0 nM at the rat ET A receptor and 1000-fold 
selectivity over the ET B receptor, The promising 
pharmacokinetics in a rat model of ET-l-induced 
vasoconstriction rendered this RPR-111844 41 an ideal 
candidate to examine these effects in preclinical models or 
cardiovascular disease [113]. 

In order to shed light on the characteristics of the 
bioactive conformation, a new series of rigidified analogues 
of stilbene acids were designed based on the SAR derived 
from a series of the phenoxybutanoic acids. Thus, compound 
RPR- 1 1 1723 42 was identified as the most potent analogue 
with an IC 50 of 80 nM. Although the stilbene series was not 
further developed, results from SAR will be back-transferred 
into the more interesting series of phenoxy butanoic acids 
[114]. 
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Lead Finding 

'^co pd-Ge neration' Bo Antagonists 

Initiated by the discovery of NPC-567 by Vavrek and 
Stewart [123] in the 90's, a number of selective peptidic B2 
receptor anatgonists including Icatibant (Hoe-140) [124,125] 
and Bradycor (Deltibant, CP-0127) [126], so-called 'second- 
generation' antagonists, have been clinically evaluated. In 
the following years, research programs were directed towards 
the discovery of B 2 -selective nonpeptide antagonists. 
Detailed overviews on this subject were provided only 
recently by Altamura et al. [127] and Heitsch [128] 
addressing projects of diverse research group, and reviewing 
the current patent situation. 

In 1993, the naphthylalanine derivative WIN-64338 43 
(Fig (17)) was disclosed as the first nonpeptide B 2 
antagonist [129,130]. A random screening approach at 
Sterling Winthrop led after optimization to compound WIN- 
64338 43, displaying a K\ value of 64 nM for the inhibition 
of [ 3 H]BK binding to the B 2 receptor (IMR-90 cells, fetal 
lung fibroblast cell line expressing the kinin B 2 receptor). 
However this compound is problematic in terms of potency, 
oral bioavailability, and selectivity [130], since significant 
affinity for e.g. the muscarinic receptors was detected [131]. 



Fig. (16). Phenoxybutanoic acid- and stilbene acid-type ET 
antagonists. 

Bradykinin 

Biomedical Significance 

The nonapeptide bradykinin (Bit, Table I), Arg-Pro-Pro- 
Gly-Phe-Ser-Pro-Phe-Arg, belongs to the family of kinins. 
Kinins are small peptides which are released from 
kinninogens by several enzymes, the kalhkreins [1 15-120]. 
Interaction of BK with two designated receptor subtypes 13 1 
and B->, results in a variety of biological effects including 
vasodilation, modulation of vascular permeability, smooth 
muscle contraction, recruitment and priming of inflammatory 
cells, induction of pain, modulation of transmitter release, 
stimulation of cell division, etc. [121]. Based on these 
diverse biological activities, BK is involved in inflammatory 
diseases, such as asthma, rhinitis, pancreatitis, sepsis, 
rheumatoid arthritis, brain oedema, and angioneurotic 
oedema [122]. Due to these patholophysiological actions ol 
BK mainly induced by the interaction with the B 2 receptor, 
this system emerged as an interesting target in 
pharmaceutical research. Hence, in a number of errorts b*L 
antagonists were presented tempted to be a valuable tool in 
the treatment of above mentioned chronic diseases. 




Fig. (17). 'Second-generation* B 2 antagonists WIN64338. 
'JfrirH-fieneration' B ? Antagonists 

From 1994 on Fujisawa published a series of patent 
applications on new classes of potent, selective and orally 
active nonpeptide B 2 receptor antagonists [132-135], thereby 
establishing the so-called 'third-generation' compounds. 
Several derivatives showed nanomolar affinity in receptor 
binding assays and high efficacy in various species including 
humans. They also exhibited in vivo functional antagonistic 
activity against BK-induced bronchoconstriction in guinea 
pigs and potency in diverse animal models of inflammation 
[132-135] [136,137]. Again, these compounds originally 
emerged from a random screening directed towards the 
angiotensin U (All) AT, receptor and belong to a class ol 
imidazo[l,2-a]pyridines. A detailed description or the 
design, synthesis and biological evaluation was given by 
Kayakiri et aL % only recently [138]. The first lead compound 
44 (Fig. (18)) of this series of ^-containing heteroaromatic 
benzyl ethers showed an IC50 value of 7.6 jiM. 

Within a classical medicinal chemistry approach based 
on SAR considerations the first compound 4^ was 
exposed to extensive modifications leading to 45 (Hg. (18))- 
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This analogue displays an IC 50 value of 2.4 nM for the 
inhibition or the specific binding of [ 3 H]BK to B 2 receptors 
in guinea pig ileum (GP1) membrane preparations. Thus, the 
8-[3-(rV-acylglycyl-A'-methylamino)-2,6-dichlorobenzyloxy- 
3-halo-2-methylimidazo[l,2-a]pyridine skeleton was 
identified as the basic framework of the first orally active 
nonpeptide B 2 antagonist. In order to overcome species 
difference, further modifications within the 3-position of the 
benzyl moiety revealed an analogue (FR167344 46) 
exhibiting subnanomolar (IC 50 =0.66 nM) and low 
nanomolar binding affinities (IC 5 o=l-4 n M) f ™ OP1 
membrane and human A431 cells (epidermoid carcinoma 
cells) [136,139], respectively. 

Recent results indicate that FR167344 46 has specific 
antagonistic activity against guinea pig tracheal smooth 
muscle BK. receptors, thus rendering it a potential 
therapeutic tool for the treatment of asthma II4UJ. 
Derivatives containing the W.tf-dimethylcarbamoyl- 
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substituted cinnamide group were capable of overcoming 
species differences, and therefore defined the required 
pharmacophore for further investigations. FR 167344 46 was 
assigned as new lead compound for three independent 
optimization approaches implying substitutions within the 
imidazo[ 1 ,2-o]pyridine moiety (benzimidazoles, 
quinoxalines, and quinolines). While further optimization of 
the quinoxaline series failed, optiization within the 
benzimidazole and quinoline series resulted in several potent 
congeners. Thus, consequent SAR studies of the 
benzimidazoles afforded improvements of in vivo oral 
activities, resulting in FR 185627 47 which exhibits 75.2 % 
inhibition against BK-induced bronchoconstriction at 0.32 
mg/kg, Lp. (138]. Optimization of the quinoline series 
afforded compound FR 173657 48 with high potency in B 2 
binding affinities for both GPI (IC 5 o=0-46 nM) and human 
recombinant B 2 receptors (IC 50 =1.4 nM) 1136,141]: 
FR173657 48 displays the best in vivo B 2 antagonistic oral 
activity among nonpeptide antagonists investigated so far 




Fig. (18). Third-generation' B 2 antagonists developed by Fujisawa. 
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and was chosen as a clinical candidate for the treatment of 
various inflammatory diseases. Recent investigations on 
plasma extravasation mediated by activation of sensory 
nerves in guinea pig airways suggest FR173657 48 to be an 
orally active, promising anti-inflammatory agent for kinin- 
dependent inflammation following antigen challenge [142]. 
Fujisawa researchers further report on the postulation of the 
active conformation of their compounds by synthezising 
conformationally restrained analogues. Molecular modelling 
studies and subsequent chemical synthesis of a novel pyrrole 
series afforded FRI93I44 49, an analogue which mimics the 
previously postulated cts-conformation of the Af-mcthylamide 
by the pyrrole moiety. FR193144 49 exhibits excellent 
binding affinity for human recombinant B 2 receptors 
(IGso=0.26 nM), thereby proving the c£s-conformation as the 
bioactive conformation of the A^-methylamide bearing 
antagonists (Fig. (18)) [138]. 

Interestingly, only minor variations within the core 
structure of the B 2 antagonists resulted in an analogue, 
FR190997 50 (Fig. (18)), exhibiting an agonistic profile 
[1431 The agonistic behaviour is hypothesized to be 
encoded in the difference concerning the 4-substituent of the 
quinoline moiety within the agonist compared to the 
antagonists (H * 2-pyridylmethoxy). FR190997 50 induces 
hypotensive response in anaesthetized rats and thus, is 
claimed for the treatment of hypertension, renal failure, heart 
failure, circulatory disorders, angina, restenosis, hepatitis etc 
[1431. 

Ri f frrfrU <tt""--™» v Rplat ^ to FR 1 73657 

Compounds evaluated at Foumier are structurally related 
to Fujisawa's quinoline series differing mainly in the 
substituent in 3-position of the benzene-linkage which is 
replaced by a sulfonamide. LF16-0335 51 (Fig. (19)) is a 
potent selective and competitive antagonist of the human B 2 
receptor, displacing pHJBK. binding to membrane 
preparations of CHO cells expressing cloned human B 2 
receptors with a K\ value of 0.84 nM. 

LF16-0335 51 shows neither affinity for the B, receptor, 
nor binds significantly to any other membrane recep tor 
except the muscarinic M2 (IC J0 =0.9 JiM) and Ml (ICso-l-O 
uM) receptors [1441. The hydrochloride of this derivative, 
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LF16-0335C. inhibits competitively BK.-induced 
contractions of isolated rat uterus and GPI in functiooal 
assays [1451- Given /.v., LF16-0335C inhibits BK-tnduced 
hypotension in both animal species in a dose-dependent 
manner [1451- Substitution of the piperaz.ne ring , in I LFI6 , 
0335 51 against a diaminopropane unit led to LF16-0687 52 
(Fie (19)) which was shown in competition binding studies 
with [ 3 H]BK to bind to the human recombinant B 2 receptor 
expressed on CHO cells with an K; value of 0.67 nM (LF16- 
0335 51 K;=0M nM). It functions as a competitive 
antagonist of BK-mediated contractions in isolated organs, 
i e rat uterus and GPI. Contrary to LF 1 6-0335 51, LFJ6- 
0687 52 showed selectivity for the B 2 receptor in binding 
and functional studies performed on more than 40 different 
receptors. 

In a new series of patent applications, Hoechst claimed a 
number of derivatives based on the lead structures delineated 
by Fujisawa as potent B 2 receptor antagonists. These 
heteroarylbenzyl ethers belong to a series of 0-substituted 8- 
quinolines or 4-benzothiazoles [146]. Heitsch et al. report 
that the potency of the quinoline series was found to be 
higher compared to the corresponding benzothiazoles. The 
most potent antagonist 53 (Fig. (20)) shows an IC 5 o value of 
0 7 nM for the inhibition of specific binding of [• > H)BK. to 
GPI membrane preparations and an EC 50 value of 4.1 nM for 
the inhibition of BK-induced contraction of isolated GPI. 

The most potent corresponding antagonist of the 
benzothiazole series 54 (Fig. (20)) exhibits an IC 30 value : of 
10 3 nM and an EC 50 value of 54 nM Another 
representative example of the B 2 antagonist claimed by 
Hoechst is compound 55 (Fig. (20)) which incorporates a 2- 
aminoethanol unit instead of the ^-methylamide as linker in 
the central part of the molecule. 55 inhibits [ 3 H]BK binding 
(GPI) with a K\ value of 20 nM (127,128). 

Based on the template FR 173657 48. Kyowa Hakko filed 
a patent application claiming heteroarylbenzyl ethers as B 2 
antagonists [147]. Like in FR173657 48. the central ether 
entity is flanked by a terminal quinoline and a 
dichlorobenzene linker. Instead of the classical N- 
methylamide sidechain in 3 position, the dichlorobenzene 
linker bears a branched hydrocarbon chain (56, Fig. (20)). 









ox 




N ^ 






X 






o ^> 




51 O 

Fig. (19). Bi receptor antagonists disclosed by Fournter. 
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Fig. (20). Miscellaneous heteroary I benzyl ether-type B2 antagonists. 

Miscellaneous Nonpeptide B2 Antagonists 

From screening of a 4000 compound combinatorial 
library, GlaxoWellcome found a promising 
tetrahydroisoquinoline, GR213548X 57 (Fig. (21)), with 
affinity for the receptor in the micromolar range [127], 

Further B2 antagonists are claimed in a series of patent 
applications by a number of companies. American Home 
Products (AHP, Wyeth Ayerst) presented compound 58 
which structurally resembles the Fujisawa derivatives only 
with respect to a quinoline entity. Pfizer described 1,4- 




O 56 



dihydropyridines such as 59 to act as B2 antagonists, while 
Eli Lilly disclosed benzothiophenes 60 (Fig. (21)) [I27J48- 
150). 



Neurokinin 

Biomedical Significance 

Neurokinins (NKs), also termed tachykinins belong to a 
family of peptides sharing a common homologous C- 
terminal fragment composed of the pentapeptide amide Phe- 
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Fig. (21). Miscellaneous B2 antagonists. 
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, Xaa-Gly-Leu-Met-NH 2 (Table I) [151]. The interaction of 
I substance P (SP), neurokinin A (NKA) , and neurokinin B 
(NKB) with their corresponding receptors [152], notably 
NK t , NK 2 , and NK 3 plays a pivotal role in induction and 
progression of inflammatory diseases. Neurokinin interaction 
is involved in a variety of physiological and pathophysiolo- 
gical conditions such as pain, inflammation, smooth muscle 
contraction, vasodilation, and activation of the immune 
system. Thus, NK receptor antagonists emerged as 
interesting agents for the treatment of primarily pain, emesis 
and asthma but also to interfere in other disorders such as 
anxiety, arthritis, migraine, cancer and schizophrenia [153- 
156]. NK receptor antagonists have been reviewed e.g. by 
Elliot and Seward [157], von Sprecher et ai [158], and, 
only recently, in Current Medicinal Chemistry by Gao and 
Peet [159]. Therefore, this contribution will solely focus on 
nonpeptide NK antagonists. 
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Lead Finding 
NKi Antagonists 

The quinuclidine-based analogue CP-96,345 61 (Fig. 
(22)) was developed from a lead structure found by random 
screening and is the first nonpeptide NK r selective 
antagonist showing an IC 50 value of 0.77 nM (lymphobiast 
IM-9 cells) [160]: Over the last years, CP-96,345 61 evolved 
as the main pharmacological tool in the area of NK receptor 
research. 

A second series of piperidine-containing analogues 
developed at Pfizer includes CP-99,994 62 [161] and CP- 
122,721 63 (Fig. (22)) [162]. CP-99,994 62 exhibits 
analgesic efficacy [163] and shows less in vivo inhibition of 
NK| receptor-mediated responses compared to the 5- 
trifluoromethoxy analogue, CP-122,721 63 (164]. The latter 
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Fig. (22). Quinuclidine-. pipcridine-. and morpholine-derived NK\ antagonists. 
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Fig. (23). Spiro-aryl piperidine-type NKj antagonists. 



congener shows improved antiemetic properties in acute 
cisplatin-induced vomiting in tumor patients when 
administered in combination with a 5-HT 3 antagonist [157 J. 

Based on the piperidine core structure of CP-99,994 62^ 
Merck synthesized L-733,060 64 (IC 50 =0.87 nM m CHO 
cells) [165] which, after modifications, led to the 
metabolically more stable L-754,030 65 (IC 50 =0.1 nM in 
CHO cells) (Fig. (22)) [166]. Recent results indicate that L- 
754,030 65 prevents cisplatin-induced emesis in patients 
receiving an anticancer chemotherapy [1 67,168]. 

Glaxo disclosed the 5-tetrazolyl-substituted analogue 
GR-203,040 66 (Fig. (22)) retaining * c >P^„ dl "« ^ 
structure of CP-99,994 62 as NK] antagonist (GR-203,040 
66: p^rlO.3 nM in CHO cells) which was selected for 
clinical evaluation in emesis and migraine I 169 ' 1 ™!; 
Further modification revealed GR-205,171 67 (Fig. (22)) 
(ptf;=10.6 nM in CHO cells) which, apart from ora 
bioavailability, exhibits also reduced L-type calcium channel 
activity, a side effect associated with e.g. CP- 122,721 63. 
GR-203,040 66 ameliorates tissue damage induced by x- 
irradiation or cisplatin [171,172]. 

Novartis developed CPG-49,823 68 (Fig. (22)), based on 
the piperidine scaffold for anxiety-related indications [173]. 
CPG-49,823 68 (IC 50 =12 nM, bovine retina) has been tested 
for its antagonistic potential against the depolarization of 
spinal motoneurones by bath application of the selective 
tachykinin receptor against septide(6-l 1) exhibiting an !C 5 o 
value of 0.3 *iM (gerbil preparations) and 7.8 nM (rat 
preparations) [174]. 

The central piperidine unit is also found in the Sanofi 
compound SR-140,333 69 (Fig. (22)) (IC 50 =0.01 nM in 
1M-9 cells), also termed Nolpitantium, which emerged from 
a random screening approach followed by a lead optimization 
program [175]. 




Investigations on the effects of SR-140,333 69 on 
nociceptive pathways in rats revealed this agent to be a 
potent drug for pain relief [176]. Kubota et al reported on 
the synthesis of spiro-piperidines as NK] receptor 
antagonists [177]. SAR studies starting from the primary- 
lead YM-35375 70 (dual MKi/NK 2 antagonist) (Fig. (23)) 
yielded analogue YM-35384 71 as a selective NK] 
antagonist which was 12-fold more potent compared to the 
original spiro-isobenzofuran-l(3//)-4'-piperidine YM-35375 
70. YM-35384 71 already showed an IC50 value of 58 nM 
which could be improved by further modification resulting in 
compound YM-49244 72 (Fig. (23)), a spiro-substituted 
piperidinium salt with an IC 50 value of 1.9 nM against SP- 
induced contraction in guinea pig ileum and inhibitory 
activity against selective NK] receptor agonist-induced 
bronchoconstriction in guinea pigs (ID 50 =24 [ig/kg, /.v.) 
U77]. 

A further class of spiro-aryl piperidines is represented by 
Merck Sharp and Dohme's spirocyclic aryl sulfonamides, 
serine-derived NKj antagonists [178). Compound 73 (Fig. 
(23)) exhibits an IC50 value of 1.0 nM for the displacement 
of [ ,25 I]SP from NKi receptors in CHO cells and served for 
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Fig. (24). Lancpitant disclosed by Eli Lilly. 
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the development of a pharmacophore model for the receptor 
binding requirements [179]. 

Eli Lilly has identified the tryptophane-dcrived LY- 
303 870 74 (Fig. (24)) as a selective antagonist binding to 
NK| with high affinity, while lacking ion channel activity 
[180]. LY-303,870, Lanepitant 74, is a candidate for clinical 
development in animal models of inflammation, pain, 
migraine, and asthma [158]. 
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Fig. (25). Perhydroisoindole-type NK| antagonists. 

RP-67,580 75 (Fig. (25)) emerged after lead optimization 
of an initial screening hit of Rhone-Poulenc Rorer's 
compound stock. RP-65,580 75 belongs to a class of 
substituted perhydroisoindoles which, apart from poor oral 
bioavailability, also suffered from L-type calcium channel 
interaction [151,181]. The follow-up compound RPR- 
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100,893 76 (Fig. (25)), Dapitant, exhibits superior binding 
affinity (IC 50 =1 3 nM, IM-9 cells) [182]. 

Investigations of the axially chiral 1 ,7-naphthydrine-6- 
carboxamide 77 (Fig. (26)) revealed that the atropisomer 
(atf)-trans-77 represents the bioactive receptor-bound 
conformation of this potent NK'i antagonist [183]. This 
analogue exhibits in vitro antagonistic activities for the 
inhibition of [ l25 I]BoItoh-Hunter(BH)-SP binding in human 
lymphoblast cells (IM-9) with an IC 50 value of 0.24 nM. 
Further, it shows in vivo potency by inhibiting capsaicin- 
induced plasma extravasation in the trachea of guinea pigs 
upon /.v. and p.o. administration. 
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Fig. (27). Dual NK1/NK2 antagonists. 

Based on this template, Natsugari et al [183] developed 
TAK-637 78 (Fig. (26)), the {aR$R)- atropisomer of a cyclic 
naphthyridine analogue. TAK-637 78 exhibits an IC50 value 
of 0.45 nM, an ID 50 of 4.3 ug/kg and 33 ug/kg after /.v and 
p.o. administration, respectively. Further it increased the 
shutdown time of distension-induced bladder contractions 
and the bladder volume threshold in guinea pigs, thus 
implying its clinical potential in the treatment of pollakiuna 
and urinary incontinence [1831- The x-ray structures of 77 
and 78 provide insights in the prerequisite structural 



Fig. (26). Naphthydrine-type NK| antagonises. 
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Fig. (28). NK2 antagonists. 

requirements for NKj receptor binding, thereby assigning the 
(a*,9*)-isomer as the active conformation [183]. 

Dual NKj/ N K2 Antagonists 

Since the release of SP and NKA causes mucus secretion, 
airway constriction, and plasma extravasation - typical 
clinical symptoms of asthma - it has been suggested to use 
dual NK|/NK 2 antagonists in the treatment of asthma [184]. 

Considering the structural requirements of Sanofi's NK 2 - 
selecttve antagonist SR-48968 82 (Fig. (28), see below), 
researchers at Yamanouchi Pharm. developed the 
spiro[isobenzofuran]piperidine YM-35375 70 (Fig. (23)) 
with binding affinity towards the NK 2 receptor with an IC 50 
value of 84 nM and an IC 50 value 710 nM for NK|, 
respectively. Further, it shows inhibitory activity (IDso=4 1 



ug/kg, i.v.) against [P-Ala 8 ]NKA(4- 1 0)-induced 
bronchoconstriction in guinea pigs [185]. Utilizing this new 
NK|/NK 2 dual antagonist as lead compound a further spiro- 
substituted piperidine analogue, YM-44778 79 (Fig. (27)), 
was developed, exhibiting potent antagonistic activities 
against the NK| (IC 5 o=82 nM) and NK 2 (1C 5 o=62 nM) 
receptors in isolated tissues [185], respectively. 

Based on L-tryptophanebenzyl esters, Qi ei aL reported 
on the synthesis of two compounds 80, and 81 with dual 
NKj/NK 2 receptor affinity (Fig. (27)) [186]. 

80 contains a 4-spiroindano piperidine and shows dual 
NK activity combined with slightly improved NK 2 activity 
(IC 50 =56 nM (*NK|) t iC 5 o=27 nM (//NK 2 )). Upon 
incorporation of a 4-spiroindolin sulfonamide, the balanced 
antagonist 81 was obtained (IC50 = 14 nM - NK|; 24 nM - 
NK 2 ). 
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NK2 antagonists are of particular interest for the treatment 
of chronic diseases such as asthma, inflammatory bowel 
disorders, rheumatoid arthritis, pain, emesis, and psychiatric 
disorders [157]. 

The first NK 2 antagonist, SR-48,968 82 (Fig. (28)), 
Saredutant, was described in 1992 [187]. This potent 
antagonist has been shown to inhibit the NKA-induced 
brochoconstriction in isolated human airways. Only recently, 
■a study of van Schoor et al. have demonstrated that NKA- 
induced bronchoconstriction in asthmatics was significantly 
reduced with 100 mg Saredutant administered p.o [188], 

Based on this prototype compound, a number of 
analogues emerged from different laboratories. SR- 144, 190 
83 (Fig. (28)) retains the phenylpiperidine moiety but 
contains an additional morpholine unit in order to introduce 
rigidity. Compared to the parent compound, it exhibits a 
similar pharmacological profile with increased bioavailability 
in the CNS[189]. 

Also Yamanouchi (YM-38336) 84 and Zeneca (ZD-7944) 
85 (Fig. (28)) presented potent NK2 antagonists based on the 
Sanofi lead structure (SR-48,968 82). ZD-7944 85 [190], 
showing a K\ value of 0.14 nM (MEL cells), still retains the 
phenylpiperidine entity, while YM-38336 84 [191] has been 
modified by introduction of a spiro-benzothiophene residue 
in position 4 of the piperidine. YM-38336 84 shows potent 

• NK2 inhibitory activity against (P-Ala 8 ]NKA(4-l0)-induced 
bronchoconstriction in guinea pigs, demonstrated by an ID50 
value of 20 mg/kg, /.v. [192]. 

Harrison et al, reported on the development of selective 
NK2 and NK3 antagonists based on a common structural 




Fig. (29). NK3 antagonists. 



template, notably the NK3-selective compound SR-142,801 
91 (Fig. (29), see below) [193]. Transfer of the carbonyl 
oxygen from an exocyclic to an endocyclic position on the 
piperidine ring led to two series of selective analogues, NK2 
and NK3 antagonists, respectively [193]. An example of a 
potent NK2 antagonist is given by compound 86 (Fig. (28)) 
which exhibits an IC50 value of 2.2 nM for the displacement 
of [ l25 I]NKA from the cloned human NK2 receptor in CHO 
cells. 

A number of preclinical nonpeptide NK2 antagonists have 
been reported by Glaxo Wei I come, Rhone-Poulenc Rorerand 
Zeneca, e.g. GR- 159,897 87, RPR- 106, 145 88 (related to 
the NK| antagonist RPR- 100,893 76, (Fig. (25))), and ZM- 
253,270 89 (Fig. (28)) [158], respectively. 

Menarini used an interestingly rigid template for its 
selective NK 2 antagonists (^=2.5 nM) MEN-11420 90, 
Nepadutant, exhibiting improved in vivo potency and 
duration which is attributed to its rigid structure [194]. 

NK 2 Antagonists 

The first selective nonpeptide NK3 antagonist, SR- 
142,801 91, Osanetant, has been reported by Sanofi 
(Ki=0.21 nM, CHO cells) (Fig. (29)) [195]. 

Based on this structural template, Merck Sharp and 
Dohme elaborated a series of NK2 and NK3 antagonists, 
exemplified with analogue 92 (Fig. (29)), the corresponding 
congener of 86 (Fig; (28)). 

SmithKline Beecham claimed NK3 antagonists for the 
treatment of CNS diseases, pulmonary disorders and 
dermatitis [196]. Based on a quinoline core structure, 
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Giardina et at. developed SB-223,412 93 (Fjg. (M)) 
demonstrating high NK 3 activity (IC 50 =1.2 nM, * r 1.0 
nM, CHO cells), weak NK 2 activity, and no affinity for other 
receptors including ion channels [197]. SB-223,412 93 
exhibits in vitro and in vivo oral and intravenous activity in 
animal models [198]. 
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An entirely novel structure, 94 (Fig. (29)), has been 
claimed as NK 3 antagonist for the treatment of bronchitis, 
asthma, anxiety, Parkinson's disease and dermatitis [199]. 
Interestingly, this compound resembles strongly the indane 
carboxylic acids of SmithKline Beecham's ET antagonists. 
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Fig. (30). Miscellaneous Yi antagonists. 
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Neuropeptide Y 

Biomedical Significance 

The 36-amino acid peptide neuropeptide Y (NPY, Table 
I) was discovered in 1982 by Tatemoto et al [200]. NPY is 
a member of the pancreatic polypeptide family, also 
including structurally related peptide YY (PYY) and 
pancreatic peptide (PP) [201]. NPY is widely distributed 
throughout the mammalian central and peripheral nervous 
system [202,203]. Interacting with its at least six receptor 
subtypes (Yj-Ye) it is involved in numerous physiological 
functions, e.g. food intake, blood pressure regulation, 
hormone secretion, sexual behaviour, and circadian rhythm 
[204-209]. Patent literature issued over the last ten years 
concentrate mainly on the inhibition of receptor- ligarid 
interactions by low-molecular weight compounds in order to 
therapeutically interfere in mechanisms such as anxiety, 
appetite stimulation, obesity, alcohol intake, hypertension, 
and regulation of coronary tone [210]. As the Y| and Y 5 
receptors are suggested to control feeding behaviour, they are 
believed to be the best target systems for developing 
antagonists as therapeutics for the treatment of obesity 
[204,21 1-213]. The Y| receptor, found in the peripheral and 
in the central nervous system (CNS), has been cloned in 
1992 [214]. Its modulation may influence numerous 
physiological conditions including anxiety, diabetes, 
obesity, or appetite disorders. Most recently, the Y 5 receptor 
has been cloned and characterized to be involved in food 
intake regulation [212]. A review published by Ling in 1999 
reports on the patent situation related to NPY antagonists 
[210]; In this contribution representative examples of 
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potentially active nonpeptide NPY antagonists will be 
described according to their target receptors. 

Yj RreTffft> r A" tagonists 

A number of Y| antagonists (Fig. (30)) published over 
the last ten years show binding affinities in the nanomolar 
range, e.g. as BIBP3226 95 (^=7.2 nM), SR120819 96 
(Ki=\5 nM), PD160170 97 (X*=48 nM), and LY-357897 98 
(Ki=0.75 nM) (Fig. (30)) [215-218]. The best characterized 
Yi antagonist BIBP3226 95 has been demonstrated to 
inhibit NPY-mediated vasoconstriction and pressure 
variations [215]. SRI 208 19 96 represents a dipeptide 
analogue containing a sulfonamide. This orally active 
antagonist incorporating a central arginine mimic 
(benzamidine in 96) develops its potency in the l,4-c/> 
disubstituted cyclohexyl ring by antagonizing NPY- 
mediated pressure responses [219]. 

Parke-Davis discovered a new and unique class of 
moderately potent but selective Y| antagonists by random 
screening of which PD 160 170 97 is a representative 
compound. Eli Lilly described LY-357897 98 from a series 
of trisubstituted indoles and benzimidazoles. Compound 99 
(Fig. (30)) [220] showing a K\ value of 2.1 |iM was 
discovered by a biased screening of the in-house library and 
served as lead structure in the subsequent SAR studies of the 
trisubstituted indole series. Consequent structure 
modification led to 98, the most active analogue (^=0.75 
nM), which, in (S)-configuration inhibits NPY-induced 
forskolin-stimulated cAMP release and intracellular Ca 2+ 
release in the nanomolar range. The corresponding 
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Fig. (31). Benzazepinonc-type Yj antagonists. 
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benzimidazoie series has also been investigated [221]. A 
representative example is given by compound 100 (Fig. 
(30)) which was obtained after systematic optimization of the 
Nl- and C4-subslituents of the benzimidazoie scaffold. 
Compound 100 exhibits in vitro binding affinity on AV-12 
cells expressing the human Y| receptor with a K\ value of 
1.7 nM. 

Pfizer claimed a series of piperazinyl-comprising 
compounds as Y|-selective antagonists [222]: Analogue 101 
(Fig. (30)) demonstrates an interesting activity profile by 
expressing a differentiated behaviour of the two conformers, 
i.e. cis- (1C 50 =76 nM) and irons- (1C 50 =525 nM) exposed 
ethyl substituent with respect to the phenylpiperazine 
substitutent of the cyclohexyl ring. 

Warner Lambert filed compounds based on a quinoline 
scaffold that were claimed as Yj subtype selective 
antagonists. The 6-aryl-sulfonyl-quinoline analogue 102 
(Fig. (30)) inhibits [ l25 l]PYY binding to the human Yj 
receptor with ah K\ value of 48 nM [2231. 

Alanex Corp. claimed two series of cornpouds containing 
either an amidino-urea or a diamidino-urea core structure. A 
representative of the latter series is given by 103 (Fig. (30)) 
inhibiting the binding of [ 125 I]PYY to the Yl receptor in 
membranes derived from human neuroblastoma cell lines 
(SK-N-MC) with an IC 50 value of 70 nM [224]. 

Bristol Myers Squibb's patents enclose two structurally 
related compound classes, i.e. phenyl-dihydropyridines [225] 
and phenyl-dihydropyrimidines [226]. In compound 104 
(Fig. (30)) the m-substituted phenyl-dihydropyridine 
sidechain is terminated with a spiroindane, a structural 
element which is also found among other antagonists 
directed against numerous members of the peptide-binding 
GPCR superfamily. 

Murakami et al. [227] at Shionogi published a novel 
class of 1,3-disubstitued benzazepinones as potent and 
selective Yi antagonists. Based on the lead compound 105 




Fig. (32). Y5 antagonists. 
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(Fig. (31)) {K\=\.5 \iM) which emerged from a random 
screening approach, follow-up compounds 106 {K\-)6Q nM) 
and 107 (Kj=39 nM) have been obtained (Fig. (31)). 

Further optimization of the phenyl substituent in 
position 3 leading to analogue 108 as well as optimization 
of the substituent in position 3 of the 2,3,4,5-tetrahydro-l//- 
l-benzazepin-2-one, represented by congener 109 (Fig. (31)) 
resulted in an increase of the binding affinity towards 43 nM 
and 2.9 nM, respectively. Combination of the optimized 
structural features led to one of the most potent derivatives 
(110, Fig. (31)) which competitively inhibits specific 
|i25]]pyy binding to Y] receptors in human SK-N-MC cells 
with a K\ value of 5.1 nM. Although 110 also antagonizes 
the Yj receptor-mediated increase in cytosolic free Ca 24 
concentration in SK-N-MC cells, it has not been evaluated 
in vivo because of its poor solubility in aqueous solution and 
poor oral bioavailability. Hence, it has been shown in 
binding assays with 17 receptors including the Y2, Y 4 , and 
Y 5 receptor that it binds selectively to the Y| receptor [227]. 

Receptor Antagonist 

Several patent applications have been filed by Novartis in 
1997 [228-230] claiming diamino quinazolines as selective 
Y5 antagonists. They were shown to inhibit NPY-induced 
Ca 2+ increase in stable transfected cells expressing the Y5 
receptor. Analogue 111 (Fig. (32)) decreases food intake by 
60% in 24 h food deprived rats after Lp. administration of 30 
mg/kg. 

In 1998 Banyu Pharm. [231,232] and Bayer [233] filed 
patents including aminopyrazoles, aminopyridines and an 
amide based core structure as Y5 antagonists. The Banyu 
compounds 112 and 113 showed IC50 values for Y5 binding 
of 8.3 nM and 4.1 nM, respectively [2341, whereas the Bayer 
compound 114 binds with an IC50 value of 0.47 nM. Also 
this congener shows selective affinity for the Y5 receptor 
compared to Y|, Y 2 , or Y 4 receptor subtypes (Fig. (32)) 
[234]. 
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STRUCTURAL-BASED DRUG DESIGN 

After having addressed the classical lead finding approach 
characterized by screening compound libraries with 
subsequent optimization, the complementary strategy of 
structure-based design will be highlighted, since this 
strategy is about to change the classical paradigm of "random 
versus rational" in favour of "random goes rational". Due to 
the fact that no high-resolution structure of any GPCR 
protein is available, all design attempts are still restricted on 
comparative analyses of structural features of biologically 
characterized low-molecular weight compounds which are 
interpreted in terms of steric and physicochemical 
complementarity to a hypothetical receptor binding site. 
Currently pursued GPCR research projects represent 
textbook examples for the fruitful combination of ligand- 
derived rationales that are incorporated into e.g. the design of 
combinatorial chemistry programs with the aim to direct 
resulting libraries more efficiently to the target class of 
interest, rather than attempting to explore systematically the 
infinite universe of molecular diversity. In the following, a 
few representative research efforts will be introduced that 
clearly attempt to change the mainstream of classical lead 
finding programs in favour of knowledge-based approaches. 



Somatostatin 

Somatostatin (Somatotropin Release-Inhibiting Factor, 
SRIF) (Table 1) was discovered because of its inhibitory 
effect on growth hormone secretion. The peptide hormone 
which exists in two biologically active forms, the 14 amino 
acid form (SRIF-14) and the 28 amino acid form (SRIF-28), 
acts as a neuromodulator [235]. 



Five receptor subtypes for somatostatin (ssti-ssts) have 
been cloned and characterized from human tissue [236]. 
Apart from its pivotal role as neuromodulator within the 
central nervous system (CNS), somatostatin alters the 
secretion of growth hormone (GH), insulin, glucagon, 
pancreatic enzymes, and gastric acid [237-240]. 
Consequently, analogues of somatostatin emerged as 
interesting tools in the treatment of disorders linked to the 
above mentioned physiological functions. Somatostatin 
agonists may therefore be used for the treatment of 
acromegaly, diabetes, cancer, rheumatoid arthritis, and 
Alzheimer's disease. Especially sst2-selective agonists 
emerged as useful candidates for the treatment of acromegaly, 
retinopathy, and diabetes [241,242]. 

The area of somatostatin agonist and antagonist research 
is a textbook example for indirect drug design utilizing 
ligand-derived structural rationales for design purposes. In 
the beginning of the 1990*s numerous design projects were 
pursued aimed to replace the peptide scaffold of the 
pharmacophore portion of somatostatin (SRIF-14) yielding 
a variety of moderately active, chemically diverse 
compounds. More recent lead finding programs employ the 
highly efficient technology of combinatorial chemistry for 
rapid modification of promising hits culminating in subtype- 
selective high-affinity binding compounds from a series of 
designed libraries. A brief overview of both, the rational 
design of single somatostatin-based peptidomimetics as well 
as the combinatorial chemistry-based approaches for lead 
identification and optimization will be given after a short 
description of the somatostatin-relevant pharmacophore 
hypothesis. 

The tetradecapeptide SRIF-14 115 (Fig. (33)), one of the 
widely distributed active forms of somatostatin, is believed 
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to adopl a two-stranded P sheet conformation induced by a P 
turn encompassing Phe 7 -Trp 8 -Lys 9 -Thr !0 , and the disulfide 
bridge between Cys 3 and Cys 14 , respectively (Fig. (33)). 
The conformation is further stabilized by the transannular H- 
bonding pattern typical for antiparallel sheet structures. From 



numerous sequence- and structure-activity studies il turned 
out that the primary pharmacophore consists of the p turn 
forming residues Phe 7 -Trp*-Lys 9 and an additional 
lipophilic binding element reminiscent to Phe 6 /Phe J 1 (243). 
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Fig. (34). Peptide conformation-derived non-peptide somatostatin antagonists. The numbering scheme refers to that of SR1F-14 (see 
Fig. (33)). 
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Fig. (35). High-affinity hssty antagonists derived by screening and subsequent optimization. 




The experimentally derived conformations of the 
metabolically more stable peptide analogues, e.g. octreotide 
(Sandostatin®) 116 [244,245] or L-363,377 117 [246,247] 
not only prove the pharmacophore hypothesis, but were 
further used as template structures underlying a series of 
rational design attempts (Fig. (33)). In 1992, researchers at 
Sandoz designed a tetra-substituted xylofuranose derivative 
118 (Fig. (34)) positioning the sidechains of Phe 7 -Trp 8 -Lys 9 
at its C-2, C-3 ( and C-5 atoms, while the benzyloxy group 
attached to C-3 resembles the aromatic sidechain of Phe If , 
respectively (Fig. (34)) [248]. 

The xylose derivative 118 displaced radio-labelled 
octreotide 116 from its receptor with an IC50 of 23 jiM. 



Even though the mutual steric fit of the xylose-based mimic 
and the somatostatin structure was reasonable, the 
compounds displayed only moderate affinity which was 
attributed to the loss of considerable conformational entropy 
during receptor binding. Consequently, the design strategy 
at Sandoz was directed towards more rigid compounds based 
on nonpeptide scaffolds. For the purpose of substituting the 
peptide backbone of SRIF-14 within the P turn portion the 
privileged structure of the 1,4-benzodiazepinone was 
employed from which the pharamcophoric groups could 
radiate into the periphery [249], The resulting nonpeptide 
tetrapeptide-mimetic 119 (Fig. (34)) was designed to account 
for the sidechains of Phe 7 -Trp 8 -Lys 9 by the appropriate 
substituents, while the aromatic ring of the benzodiazepine 



Fig. (36). Side-by-side stereo presentation of the structural overlay of 123 (ball-and-stick mode) onto the experimentally 
conformation of 117 (stick-mode). 
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core was believed lo mimic the additional lipophilic element 
referring to Phe^Phc 11 , respectively. However, the racemic 
mixture ofl 19 (benzodiazepine) showed an IC 50 of 7 jiM, 
and even after separation, the L- and D-Trp containing 
benzodiazepine displaced the radioligand with 1C 50 of 
only 6.5 (iM and 8.2 ^M, respectively. 

Similar affinities in the low micromolar range were 
obtained with pepiidomimetics based on 0-D-glucose 
scaffolding described by Hirschmann and Nicolaou at the end 
of the 80*s and beginning of the 90's [250]. Molecular 
modeling studies carried out on the 3D structures of SRIF- 
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14 115 and analogues of L-363,377 117 suggested that 
substituents at C-2, C-l, and C-6 of a fi-D-glucose template 
resemble the orientational pattern of the P turn-forming 
amino acids of the somatostatin-derived peptides. The 
corresponding penta-substituted glucose 120 (Fig. (34) 
showed an IC50 of 1 5 u.M. 

In 1996, researchers from Rhone- Poulenc Rorer published 
a similar approach of de-novo designed peptidomimetics 
employing aza-sugar-based templates for the spatially 
controlled orientation of the pharamcophoric amino acid 
sidechains [25 1]. Independent of ring size and substitution 




H 2 

128 

L-817.818 
Ki = 0.4 nM 
h-sst5 selective 




127 

L-803.087 
Ki = 0.7 nM 
h-sst4 selective 



Fig. (37). For each somatostatin receptor subtype {hss\\ ~hss\$) highly selective compounds emerged from rationally designed 
combinatorial libraries. 
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.pattern, all analogues showed weak affinity with IC50 values 
}in the range of 10-15 u,M (see for example 121, Fig. (34)). 

Over the last two years, scientists at the Merck Researph 
Laboratories conducted a comprehensive program aimed to 
identify subtype-selective peptidomimetic compounds for 
each somatostatin receptor subtype (sstj-sstj) by following a 
rational design strategy using a combination of classical 
medicinal chemistry with modern combinatorial chemistry 
techniques [252-256]. The primary lead. L-264,930 122 
(Fig. (35)), that initiated that combined approach, was 
identified by a virtual screening of the Merck sample 
collection. The 3D structure of the cyclic hexapeptide L- 
363,377 117 (Fig. (33)) served as spatial probe in that a 
geometric pattern, describing the arrangement of the 
pharmacophoric groups, was derived by means of molecular 
modeling. After similarity searches, in which the sidechains 
of residues Tyr 7 -Trp 8 -Lys 9 were given priority for the 
pharmacophore definition, L-264,930 122 was uncovered 
with submicromolar affinity for the /?sst2 receptor. 

This compound became the primary focus for medicinal 
chemistry and combinatorial chemistry at Merck. By 
constraining the floppy diamine chain with a 1,3-bis- 
aminomethyl-cyclohexane moiety the compound was 
optimized to yield L-054,264 123 (Fig. (35), Fig. (36)) with 
an IC50 of 1.6 nM for the hssfy receptor and a more than 
1000-fold selectivity over all other somatostatin receptor 
subtypes. 

, Simultaneously, L-264,930 122 served as lead structure 
for a targeted combinatorial library. For library design the 
lead was dissected into three components, notably the central 
a-amino acid, the C-terminal blocking diamine, and the N- 
terminal blocking bulky urea-attached amine. The initial 
library was based on 20 a-amino acids, that were mainly 
analogues of Trp or carried modified aromatic sidechains. 
Additionally, 20 diamines were chosen in which the spacing 
between the two nitrogens varies between four and six 
atoms, also encompassing different ring topologies. The 
amine collection comprised 79 different entities that were 
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biased towards piperidines and piperazines containing 
additional aromatic rings, so-called "privileged structures". 
A solid-phase mix-and-split protocol was used to synthesize 
more than 130000 compounds in complex mixtures that 
demanded a deconvolution strategy. After several rounds of 
iterative optimization employing classical analoging as well 
as follow-up libraries, five compounds 124 - 128 emerged 
with the desired activity and selectivity profile, in that each 
compound is highly selective for a distinct somatostatin 
receptor subtype (Fig. (37)). 

This program impressively demonstrates the impact of an 
intelligent combination of structural rationales derived by 
comprehensive molecular modeling with the synthetic 
efficiency of current combinatorial chemistry techniques for 
lead finding attempts within modern medicinal chemistry. 

A further example of a peptidomimetics-based library 
employing structure rationales for identification of subtype- 
selective somatostatin analogues was published recently by 
J. Ellman and co-workers (Fig. (38)) [257]. By decoration of 
a medium-sized heterocyclic P turn mimic with the Trp- and 
Lys-sidechain in positions z + 7-7+2 and vice versa, together 
with an additional amine building block in /+i, a 
remarkably small library of only 172 entities (22 amines, 
D/L-Trp-D/L-Lys, D/L-Lys-D/L-Trp) uncovered a hssty 
selective compound 129 with an IC50 of 87 nM. 



Bradykinin 

Researchers at Sterling Winthrop considered angiotensin- 
converting-enzyme (ACE) inhibitors as templates for the 
design of BK B2 receptor antagonists [258], since ACE 
degrades both, angiotensin II (AH) and BK by cleaving the 
Pro 7 -Phe 8 amide bond. Therefore, an ACE inhibitor was 
considered to display properties or conformational 
similarities to BK, thus establishing a pharmacophoric link 
between ACE and BK receptors in that both macromolecules 
recognize similar steric and physicochemical features. In 
order to test this hypothesis, the ACE inhibitor Quinapril 




NH 



/-CO-N-S. 




129 





Fig. (38). Lett; /isst 5 -selecuve compound derived from a p lurn-templated library; right: side-by-side stereo presentation of the 
structural superposition of the P turn mimic (batl-and-stick mode) onto the PU* turn portion of 117 (stick-mode). 



1636 Current Medicinal Chemistry- 200/. Vol 8. No. 13 

130 (Fig. (39)) |259] was chosen as template for the design 
and synthesis of a series of homoPhe-T'ic (Tic; 
tetrahydrisoquinoline) containing compounds. The 
diastereomeres of 131 (Fig. (39)) exhibit binding affinities in 
the micromoiar range (K\ = 1 jiM) in [ 3 H]BJt binding 
studies with human iMR-90 fetal lung fibroblasts. 




N ^C0 2 H 



' ffo 



P(n-Bu) 3 




131 



Fig. (39). Quinapril (1 3D) served as template for the design of 
BK antagonists (e.g. 131). 

Goodfellow et al. [260] followed a different approach in 
that they established a library based on a p turn template, 
CP-0597 132 (Fig. (40)) [261] which is a peptidic B|/B 2 
antagonist containing D-Tic and N-Chg (Chg: N- 
cyclohexylglyine) in /+ / and /+2 position of a PIF turn. 
Starting from that structural rationale, the peptidomimetic 
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CP-2055 133 (Fig. (40)) was generated. Based on the 1.4- 
piperzine scaffold a combinatorial library has been designed 
to produce approximately 2500 rationally directed diverse 
analogues (RDDA), 134 (Fig. (40)). 

This process led to the discovery of nonpeptide B 2 
antagonists serving as lead compounds for traditional 
optimization. While the parent peptidic analogue CP-0597 
132 shows an 1C 50 value of 0.33 nM, CP-2055 133 exhibits 
an IC50 value of about 55 jiM on a cloned human B 2 
receptor. CP-2458 is a further a member of the designed 
library 134 and inhibits human B 2 receptor binding 
(IC5o=4.l jiM) and BK-stimulated Ca 2+ flux in human 
fibroblasts (IC5Q= 1 9 u,M). Unfortunately, the chemical 
formula of the compound is not given explicitly in the 
publication. 

Based on two structural templates (i) a cyclic hexapeptide 
BK antagonist 135 [262] and (ii) the nonpeptide BK 
antagonist W1N-64338 43 (Fig. (41)) [129], Dankwardt et 
al. [263] designed nonpeptide B 2 antagonists. While the 
hexapeptide served as structural template for the positioning 
of relevant functionality, AYIN-64338 43 served as rigid 
scaffold for the design of a series of naphthylalanine 
containing derivatives, none of which showed improved 
affinity for the B 2 receptor when compared to W[N-64338 43 
(K& = 44 nM. Substitution of the phosphonium group 
against the corresponding ammonium moiety resulted in a 
two-fold decrease in affinity for the B 2 receptor. However, the 
proposed structural superposition of the cyclic hexapeptide 
135 with the blocked amino acid derivative 43 provided a 
pharmacophore hypothesis that enabled Dankwardt and 
coworkers to design moderately active compounds and 
might serve as structural blueprint for further design 
attempts[263]. 



Neurokinin 

The structural feature of a reverse 0 turn has emerged to a 
general design principle underlying a variety of GPCR 
antagonist projects. P truns play an important role in 
recognition phenomena as documented e.g. for somatostatin 
and NKA which bind to their receptors in a proposed P turn 
conformation. Therefore Horwell et al. [264] at Parke-Davis 
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Fig. (40). Design strategy of BK antagonists following the "rationally directed diverse analogues* approach. 
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decided to employ 3 turn mimetics for the design of 
compounds with affinity for the NK2 receptor. Starting from 
the x-ray structure of MEN- 1 0627 136 (Fig. (42)) [265], a 
cyclic hexapeptide displaying high NK 2 affinity, a 
pyrrolidine-based Trp-Phe dipeptide mimetic 137 has been 
designed (Fig. (42)). 

The Trp-Phe dipeptide scaffold mimics the Trp-Phe 
^fragment in the central portion (/+/, />2) of a pi turn within 
the cyclic hexapeptide which folds into a pi/pil turn 
conformation. Although the indole and benzyl sidechains of 
both compounds superimpose satisfactory, 137 did not show 
significant NK 2 receptor affinity. The lack of affinity has been 
attributed to the misfit of the dipole moments of both 
molecules. In order to address this problem in more detail, a 
further Trp-Phe dipeptide mimetic 138 (Fig. (42)) has been 
designed by computer-assisted molecular modeling 
identifying a 2-azabicyclonorbornan spacer to be more 
favourable compared to the pyrrolidine (Fig. (43)). 



Comparison of the binding affinities revealed that the 
conversion of the hexapeptide to a dipeptide unit results in 
the loss of high binding affinity (MEN 10627 136: 
fC 50 =0.079 nM (NK 2 ); 137: IC 50 =14% @ 10 jiM (NK 2 ); 
138: IC 5 o=31% @ 10 U.M (NK 2 )) studied by displacement 
assays with ( l25 I]NKA in hamster urinary bladder. On the 
other hand, [ ,25 I]BH-SP displacement from NK] in human 
IM-9 cells of MEN- 1 0627 136 (IC 50 =0.8 u,M) is retained by 
137 and 138 with IC 50 values of 3.7 \iM and 6.7 U.M, 
respectively. Interestingly, the dipeptide mimetics exhibit 
some binding affinity to human NK3 receptors stably 
expressed in CHO cells shown by replacement of { I25 IJ- 
[MePhe 7 )NKB (137: IC 50 (NK 3 )=3.5 \iM; 138: 
IC 5 o(NK 3 )=35 0 /o @ 10 |iM) while the parent hexapeptide 
exhibited no NK3 affinity at all. 

Only recently, Porcelli el al. [266] presented the design 
of a SP antagonist based on a cyclic pentapeptide with the 
chirality sequence following a D ! L 2 D 3 D 4 L* pattern. The 






138 



136 



Fig. (42). Peptide structure-derived rationales were used to design non-peptide NK antagonists (Dap: 2 f 3-diaminopropanoic acid). 
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Fig- (43)- Side-by-side stereo presentation of the structural overlay 
mode) within the turn corresponding portion. 

authors suggest this scaffold as a generic template to design 
antagonists also for other members of the GPCR family. 
This suggestion is the logical consequence of the fact that 
among potent GPCR antagonists the same unique skeleton 
is found among other representatives of antagonists for 
peptide-binding GPCRs, e.g. the natural pentapeptide BE- 
18257B (cyc/o-(D-al!o-Ile-Leu-D-Trp-D-Glu-Ala-)) and its 
synthetic analogue BQ-123 (cyc/o-(D-Val-Leu-D-Trp-D- Asp- 
Pro-)) (267), a prominent ET A antagonist. Both cyclic 
pentapeptides follow the chiral sequence pattern of DLDDL. 
The solution structure of BQ-123 [268] exhibits a typical 
Pll/Yi turn arrangement characteristic for this class of 
molecules. Based on the same structural template, Porcelli el 
al designed a SP antagonist, ITF-1565 (cyc/o-(D-Trp'-Pro 2 - 
D-Lys 3 -D-Trp 4 -Phe 5 -)) which inhibits NK| -mediated SP- 
induced contraction of the rabbit caval vein. ITF-1565 only 
shows modest NK2 activity and was inactive in ET A assays. 
ITF-1565 exhibits a pil/y turn arrangement with Pro 2 in rW 
and D-Lys 3 in /+2 position of the 0 turn and Phe 3 in the 




139 

Fig. (44). Glucose-based peplidomimetic NK analogue. 
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138 (ball-and-stick-mode) onto the x-ray structure of 136 (stick- 



central position of the y turn. Interestingly, the authors 
succeeded to superimpose the sidechain functionalities of D- 
Trp 4 , Phe 5 and D-Trp 1 within ITF-1565 well onto the 
indole and benzyl rings within a P-D-glucose derived SP 
antagonist 139 (Fig. (44)). 



Luteinizing Hormone-Releasing Hormone 

The decapeptide amide Luteinizing Hormone-Releasing 
Hormone (LHRH, Table 1) [269], pGlu-His-Trp-Ser-Tyr- 
Gly-Leu-Arg-Pro-Gly-NH2» is released from the 
hypothalamus and stimulates the anterior pituitary gland 
resulting in the secretion of the gonadotropins luteinzing 
hormone (LH) and follicle-stimulating hormone (FSH). 
LHRH, also termed gonadotropin-releasing hormone, plays 
an important role in the regulation of reproductive functions, 
thus rendering its synthetic analogues useful tools for the 
treatment of endocrine-based diseases like prostate and breast 
cancer, endometriosis, uterine leiomyoma, and precocious 
puperty [270]. Even though LHRH agonists proved to be 
useful in the treatment of the above mentioned diseases [27 1 - 
273], research has also focused on the development of potent 
and safe antagonists. 

Recently, Takeda presented a substituted 4- 
oxothieno[2,3-£]pyridine as a highly potent and orally active 
nonpeptide antagonist of the human LHRH receptor [274], 
Again, this research program was based on the structural 
characteristics of a P turn suggested as the dominant 
conformational feature within [5-8]LHRH (Fig. (45)) 
[272,273]. 
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Fig. (45). p turn-derived design strategy uncovered highly active non-peptide LHRH analogues. 



The P turn is considered to represent the bioactive 
conformation of LHRH in the receptor-bound state. 
Therefore, the structural element of a P turn was attempted to 
be transferred onto a rigid scaffold which mimics the P turn 
and can be decorated with the crucial functionalities, thus 
positioning them into the receptor-complementary 
orientation (Fig. (45)). For this purpose, a directed screening 
approach was initiated aimed to uncover compounds 
showing similarity to the turn template. The screening 
towards the inhibitory effect on the specific binding of 
[ l25 I]leuprolelin to human LHRH receptor [275J expressed in 
CHO cells resulted in the initial lead compound 140 (Fig. 
'(45)) [274]. 

This compound was structurally compared to the 
hypothesized P turn arrangement and changed in order to 
fulfil the structural requirement imposed by that template. 




e.g. substituting Gly by hydrophobic D -amino acids 
increased activity presumably due to stabilization of the P 
tum by introducing a D-amino acid into the /+/ position of 
the P turn. Subsequent modifications finally led to the 
discovery of T-98475 141 (Fig. (45)) exhibiting an IC 50 
value of 0.2 nM for the binding to the cloned human LHRH 
receptor. Further, T-98475 141 shows inhibitory effects on 
LHRH-stimulated LH release in functional in vitro and in 
vivo assays. Thus, T-98475 141 is a good candidate of a 
new class of therapeutics for the treatment of LH-induced 
dysfunctions in sex-hormone-dependent pathologies. 



C5a 

The 74 amino acid peptide C5a (Table 1) is released after 
activation of the complement system at sites of inflammation 




142 



143 




144 




145 



Fig. (46). C5a analogues. 
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by proteolytic cleavage of the complement factor C5 [276]. 
The hormone-like peptide anaphylatoxin, C5a, acts as 
chemotaxin by attracting and promoting the degranulation of 
granulocytes and macrophages during immune response 
[277 278] Inappropriate activation of C5a results in a 
number of inflammatory diseases including rheumatoid 
arthritis [279], Alzheimer's disease [280]. ischemic heart 
failure [281], psiorasis [282], atherosclerosis [283], and 
adult respiratory distress syndrome (ARDS) [284], In this 
sense, agents preventing the interaction of C5a and its 
receptor, C5aR, would be useful for inhibition of the pro- 
inflammatory function of C5a, thus, being a useful 
therapeutic in the treatment of chronic inflammatory 
disorders induced by activation of the complement system 
and the release of C5a [285,286]. The binding of the small 
protein C5a to its receptor is characterized by two interaction 
sites. A two-site model has been proposed localizing the 
major binding epitope for the ligand C5a in the extracellular 
^-terminal region of the receptor, while the second binding 
cavity is located in the core of the transmembrane helix 
bundle, obviously serving as the "activating binding site" 
recognizing the C-termina! octapeptide of the hgand 
[287,288]. Starting from the sequence of the native ligand a 
number of peptide-based antagonists were discovered which 
have been reviewed only recently by Wong et al. 1289]. 
Obviously, the development of a nonpeptide antagonist in 
this filed is a major challenge since research revealed only 
low molecular weight compounds acting as C5a agonists or 
at least partial agonists over the last two decades. 

Merck identified an initial lead 142 (Fig. (46)) by 
screening an in-house sample collection for the displacement 
of [ ,25 l]C5a from human neutrophil membrane preparations 
which served for further optimization [290]. 

The spiroindane-bearing hydantoin 142 has been 
modified by introduction of a cyclohexylmethy! group 
instead of the benzyl residue resulting in compound 143 
(Fig. (46)) which exhibits an IC 30 value of 0.3 nM. 

Surprisingly, functional receptor assays revealed that all 
compounds of this series with affinity for C5aR showed an 
agonistic potential. The only nonpeptide antagonists have 
been reported by Merck investigating 4,6-diaminoquinolines 
(144) [291] and Rhone-Poulenc Rorer identifying a 
phenylguanidin by random screening (145, IC 50 =0.8 U.M) 
(Fig. (46)) [292]. 

As random screening techniques have not brought the 
expected success, rational design would offer an alternative in 
the lead finding process for C5a antagonists. Based on the 
results of conformational studies of cyclic pentapeptide ET 
antagonists, BE-18257B and BQ-123 [293,294], Wong and 
co-workers [295,296] followed the same strategy as presented 
by Porcelli et al [266] for the design of the SP antagonist, 
ITF-1565. BQ-123, ^-(D-VaM-Leu^D-Trp^D-Asp^ 
Pro 5 -) and ITF-1565, cyc/o^D-Trp^Pro^D-Lys^D-Trt) 4 - 
Phe 5 -) follow an identical chirality pattern of D l/D D L 
leading to a pII/Y (i) tum arrangement with L/-D 3 in /+ 7 and 
i+2 position of the p turn and L 5 in the central position of 
an (inverse) y turn. The strategy seems also to be applicable 
to C5a, since the C-terminal-derived C5a antagonist NMe- 
Phe-Lys-Pro-D-Cha-Trp-D-Arg (Cha: cyclohexylalanine) 
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shows a well defined structure in solution in which the 
lysine sidechain is in close proximity to the D-arginine 
carboxylate. Ring closure resulted in a backbone-to-sidechain 
cyclized peptide, cyc/o-Ac-Phe-(Om-Pro-D-Cha-Trp-D-Arg-) 
(brackets indicate the sidechain-to-backbone mode of 
cyclization, Om-NH £ -CO-D-Arg) with an IC 50 value of 9.28 
U.M for the displacement of [ ,25 I]C5a from human 
polymorphonuclear (PMN) cells. Conformational analysis 
revealed a y turn with Pro in the central position stabilized 
by a hydrogen bond between the flanking amino acids, Orn- 
CO HN-D-Cha, together with a "pseudo" pi! tum involving 
D-Cha-Trp-D-Arg-Om defined by a second hydrogen bond 
between D-Cha-CO *'H c N-0rn. This is consistent with 
<|> i+ |/H/ i+ l and <t>j+2/Vi+2 dihedrals of Trp and D-Arg (- 
58°/90°; 69°/-3°) confirming a p turn type II (ideal values: 
-60°/120°; 80°/0°) arrangement [295]. More detailed SAR 
studies showed that the L-Arg containing isomer is much 
more active than the D-Arg congener (lCso^O nM; 
inhibition of C5a-induced release of myeloperoxidase from 
PMNs). The NMR-derived solution structure reveals an 
inverse y tum (yO involving D-Cha-Trp-Arg stabilized by a 
hydrogen bond between D-Cha-CO "HN-Arg [296]. 



CONCLUSION 

This review was intended to highlight not only the 
relevance of the GPCR superfamily for drug development 
purposes during the last decade, but also the tremendous 
potential of that particular target class for future medicinal 
chemistry programs aimed to uncover new ligands for 
peptide-binding GPCRs. Especially the cross-fertilizing 
combination of ligand-derived structure rationales with the 
dramatically enhanced efficiency of automated synthesis and 
combinatorial chemistry will enable pharmaceutical research 
to identify new chemical entities more rapidly. Even though 
we have witnessed a technology-based quantum leap forward 
in efficiency within medicinal chemistry in the late 1990's, 
the vigorous search for novel GPCR genes within e.g. the 
human genome has far outpaced the identification of novel 
endogenous and exogenous ligands. The identification of 
these ligands remains one of the most challenging tasks in 
modem pharmacology. The number of GPCRs for which 
endogenous or exogenous ligands are unknown today 
continues to increase, thus offering modem pharmaceutical 
research new opportunities in that entirely new drug targets 
associated with innovative therapeutic principles emerge. In 
this context, new low-molecular weight -ligands for these 
orphan receptors will undoubtedly lead to novel insights 
into the complexity of numerous poorly understood human 
disorders. Consequently, targeted medicinal chemistry 
approaches towards members of the GPCR family will 
facilitate the understanding of the precise physiological role 
of orphan receptors as well as produce new compounds as 
qualified lead structures for clinical development. 

Concluding, the field of GPCR research is clearly 
expected to grow dramatically due to the progress that will 
be made in the human genome initiative, demanding 
increased contributions from medicinal chemistry in order to 
provide new pharmacological tools as well as new leads for 
the development of new drugs. 
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derstanding human evolution, the causation 
of disease! and the interplay between > the 
environment and heredity in defining the hu- 
man condition. A project with the goal of 
determining the complete nucleoUde se- 
quence of the human genome was first for- 
Sy proposed in 1985 (/). In subsequent 
Tars the idea met with mixed react.ons in 
L scientific community (2). However in 
1990 the Human Genome Project (HGP) was 
Sally initiated in the United States under 
the d Jction of the National Institutes of 
Health and the U.S. Department of Energy 
with a 15-year, $3 billion plan for completing 
The genome sequenced 1998 we announced 
our intention to build a unique genome- 
sequencing facility, to 
nuence of the human genome over a 3-year 
neriod Here we report the penultimate mile- 
£ne along the path toward that goal, a nearly 
complete J equence of the euchromatic por- 
tion of the human genome. The sequencing 
was performed by a whole-genome random 
shotgun method with subsequent assembly of 
the sequenced segments. 

The modem history of DNA sequencing 
began in 1977, when Sanger reported his meth- 
odfor determining the order of nucleotides of 



DNA using cham-terminating nucleotide ana- 
logs (3). In the same year, the first human gene 
was isolated and sequenced (4). In 1986, Hood 
and co-workers (5) described an improvement 
in the Sanger sequencing method that included 
attaching fluorescent dyes to the nucleotides 
which permitted them to be sequentially read 
by a computer. The first automated DNA se- 
quencer, developed by Applied Biosysterns in 
Califormainl987,wasshowntobesuccesstul 
when the sequences of two genes were obtained . . 
with this new technology (6), From early se,. 
quencing of human genomic regions (7), it 
became clear that cDNA sequences (which are 
reverse-transcribed from RNA) would be es- 
sential to annotate and validate gene predictions 
in the human genome. These studies were the 
basis in part for the development of . the ex- 
pressed sequence tag (EST) method of gene 
identification (*), which is a random selechon,. 
very high throughput sequencing approach to 
characterize cDNA libraries. The EST metiiod 
led to the rapid discovery and mapping of hu- 
man genes (9). The increasing numbers of hu- 
man EST sequences necessitated the deve op- 
ment of new computer algorithms to analyze 
large amounts of sequence data, and Un 1993 a 
The Institute for Genomic Research (TIGR), an 
algorithm was developed that permitted assem- 
bly and analysis of hundreds of thousands of 
ESTs. This algorithm permitted characteriza- 
tion and annotation of human genes on the basis 
of 30,000 EST assemblies {10). 
■ The complete 49-kbp bacteriophage lamb- . 

da genome sequence was determined by a 
shotgun restriction digest method in 1982 
U J). When considering methods for sequenc- 
ing the smallpox virus genome in 1991 (U), 
a whole-genome shotgun sequencing method 
was discussed and subsequently rejected ow- 
ing to the lack of appropriate software tools 
for genome assembly. However, in 1994 
when a microbial genome-sequencing project 
was contemplated at TIGR, a whole-genome 
shotgun sequencing approach was considered 
possfoTe with the TIGR EST assembly algo- 
rithm. In 1995, the 1.8-Mbp Haemophilus 
influenzae genome was coveted by a 
whole-genome shotgun sequencing method 
(13). The experience with several subsequent 
genome-sequencing efforts, established the 
broad applicability of this approach (14, 1 5). 

A key feature of the sequencing approach, 
used for these megabase-size and larger ge- 
nomes was the use of paired-end sequences 

(also called mate pairs), derived from sub 
clone libraries with distinct insert sizes and 
cloning characteristics. Paired-end sequences 
are sequences 500 to 600 bp in leng* from 
both ends of double-stranded DNA clones of 
Scribed lengths. The success , of « «J 
sequences from long segments > (18 t £0 »P) 
of DNA cloned into bacteriophage lambda in 
assembly of the microbial genomes Tedtojhe 
suggestion (16) of an approach to simulta 
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r ieously map and sequence the human ge- 
nome by means of end sequences from 150- 
kbp bacterial artificial chromosomes (BACs) 
(77, 18). The end sequences spanned by 
known distances provide long-range continu- 
ity across the genome. A modification of the 
BAG end-sequencing (BES) method was ap- 
plied successfully to complete chromosome 2 
. from the Arabidopsis thaliana genome {19). 

In 1997, Weber and Myers (20) proposed 
whole-genome shotgun sequencing of the 
human genome. Their proposal was not well 
received (21). However, by early 1998, as 
less than 5% of the genome had been se- 
quenced, it was clear, that the rate of progress 
in human genome sequencing worldwide 
was very slow (22), and the prospects for 
finishing the genome by the 2005 goal were 
uncertain. 

In early 1998, PE Biosystems (now Applied 
Biosystems) developed an automated, high- 
throughput capillary DNA sequencer, subse- 
quently called the ABI PRISM 3700 DNA 
Analyzer. Discussions between PE Biosystems 
and TIGR scientists resulted in a plan to under- 
take the sequencing of the human genome with 
the 3700 DNA Analyzer and the whole-genome 
shotgun sequencing techniques developed at 
TIGR (23). Many of the principles of operation 
of a genome-sequencing facility were estab- . 
lished in the TIGR facility (24). However, the 
facility envisioned for Celera would have a 
capacity roughly 50 times that of TIGR, and 
thus new developments were required for sam- 
ple preparation and tracking and for whole- 
genome assembly. Some argued that the re- 
quired 150-fold scale-up from the H. influenzae 
genome to the human genome with its complex 
repeat sequences was not feasible (25). The 
Drosophila melanogaster genome was thus 
chosen as a test case for whole-genome assem- 
bly on a large and complex eukaryotic genome. 
In collaboration with Gerald Rubin and the 
Berkeley Drosophila Genome Project, the nu- 
cleotide sequence of the 120-Mbp euchromatic 
portion of the Drosophila genome was deter- 
mined over a 1-year period (26-28). The Dro- 
sophila genome-sequencing effort resulted in 
two key findings: (i) that the assembly algo- 
rithms could generate chromosome assemblies 
with highly accurate order and orientation with 
substantially less than 10-fold coverage, and (ii) 
that undertaking multiple interim assemblies in 
place of one comprehensive final assembly was 
not of value. 

These findings, together with the dramatic 
changes in the public genome effort subsequent 
to the formation of Celera (29), led to a modi- 
fied whole-genome shotgun sequencing ap- 
proach to the human genome. We initially pro- 
posed to do 10-fold sequence coverage of the 
genome over a 3-year period and to make in- 
terim assembled sequence data available quar- 
terly. The modifications included a plan to per- 
form random shotgun sequencing to -5-fold 
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coverage and to use the unordered and unori- 
ented BAG sequence fragments and subassem- 
blies published in GenBank by the publicly 
funded genome effort (30) to accelerate the 
project We also abandoned the quarterly an- 
nouncements in the absence of interim assem- • 
blies to report. 

Although this strategy provided a reason- 
able result very early that was consistent with a 
whole-genome shotgun assembly with eight- 
fold coverage, the human genome sequence is 
not as finished as the Drosophila genome was 
with an effective 13-fold coverage. However, it 
became clear that even with this reduced cov- 
erage strategy, Celera could generate an accu- 
rately ordered and oriented scaffold sequence of 
the human genome in less than 1 year. Human 
genome sequencing was initiated 8 September 
1999 and completed 17 June 2000. The first 
assembly was completed 25 June 2000, and the 
assembly reported here was completed 1 Octo- 
ber 2000. Here we describe the whole-genome 
random shotgun sequencing effort applied to 
the human genome. We developed two differ- 
ent assembly approaches for assembling the -3 
biUion bp that make up the 23 pairs of chromo- 
somes of the Homo sapiens genome. Any Gen- 
Bank-derived data were shredded to remove 
potential bias to the final sequence from chi- 
meric clones, foreign DNA contamination, or 
misassembled contigs. Insofar as a correctly 
and accurately assembled genome sequence 
with faithful order and orientation of contigs 
is essential for an accurate analysis of the 
human genetic code, we have devoted a con- 
siderable portion of this manuscript to the 
documentation of the quality of our recon- 
struction of the genome. We also describe our 
preliminary analysis of the human genetic 
code on the basis of computational methods. 
Figure 1 (see fold-out chart associated with 
this issue; files for each chromosome can be 
found in Web fig. 1 on Science Online at 
www.sciencemag.org/cgi/content/full/291/ 
5507/1304/DC1) provides a graphical over- 
view of the genome and the features encoded 
in it. The detailed manual curation and inter- 
pretation of the genome are just beginning. 

To aid the reader in locating specific an- 
alytical sections, we have divided the paper 
into seven broad sections. A summary of the 
major results appears at the beginning of each 
section. 

1 Sources of DNA and Sequencing Methods 

2 Genome Assembly Strategy and 
Characterization 

3 Gene Prediction and Annotation 

4 Genome Structure 

5 Genome Evolution 

6 A Genome-Wide Examination of 
Sequence Variations 

7 An Overview of the Predicted Protein- 
Coding Genes in the Human Genome 

8 Conclusions 



1 Sources of DNA and Sequencing 
Methods 6 

Summaiy. This section discusses the rationale 
and ethical rules governing donor selection to 
ensure ethnic and gender diversity along with 
the methodologies for DNA extraction and |j. 
brary construction. The plasmid library con- 
struction is the first critical step in shotgun 
sequencing. If the DNA libraries are not uni- 
form in size, nonchimeric, and do not randomly 
represent the genome, then the subsequent stern 
cannot accurately reconstruct the genome se- 
quence. We used automated high-throughput 
DNA sequencing and the computational infra- 
structure to. enable efficient. tracking of cnor-. 
mous amounts of sequence information (27.3 
million sequence reads; 14.9 billion bp of se- 
quence). Sequencing and tracking from both 
ends of plasmid clones from 2-, 10-, and 50-kbp 
libraries were essential to the computational 
reconstruction of the genome. Our evidence 
indicates that the accurate pairing rate of end 
sequences was greater than 98%. 

Various policies of the United States 3tid the 
World Medical Association, specifically ihc 
Declaration of Helsinki, offer recommenda- 
tions for conducting experiments with human 
subjects. We convened an Institutional Re- 
view Board (IRB) (37) that helped us estab- 
lish the protocol for obtaining and using hu- 
man DNA and the informed consent process 
used to enroll research volunteers for the 
DNA-sequencing studies reported here. We 
adopted several steps and procedures to pro- 
tect the privacy rights and confidentiality of 
the research subjects (donors). These includ- 
ed a two-stage consent process, a secure ran- 
dom alphanumeric coding system for speci- 
mens and records, circumscribed contact with 
the subjects by researchers, and options for 
off-site contact of donors. In addition, Celera 
applied for and received a Certificate of Con- 
fidentiality from the Department of Health 
and Human Services. This Certificate autho- 
rized Celera to protect the privacy of the 
individuals who volunteered to be donors as 
provided in Section 301(d) of the Public 
Health Service Act 42 U.S.C. 24 1 (d). 

Celera and the IRB believed that the ini- 
tial version of a completed human genome 
should be a composite derived from multiple 
donors of diverse ethnic backgrounds Pro- 
spective donors were asked, on a voluntary 
basis, to self-designate an ethnogeographtc 
category (e.g., African-American, Chinese, 
Hispanic, Caucasian, etc.). We enrolled 21 
donors (32). 

Three basic items of information from 
each donor were recorded and linked by con- 
fidential code to the donated sample: age. 
sex, and self-designated ethnogeographic 
group. From females, -130 ml of whole, 
heparinized blood was collected. From males, 
-130 ml of whole, heparinized blood was 
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collected, as well as five specimens of semen, 
collected over a 6-week period. Permanent 
lyrnphoblastoid cell lines were created by 
Epstein-Barr virus immortalization. DNA 
, from five subjects was selected for genomic 
' DNA sequencing: two males and three fe- 
males-one African-American, one Asian- 
Chinese, one Hispanic-Mexican, and two 
Caucasians (see Web fig. 2 on Science Online 
at www.sciencemag.org/cgi/content/291/5507/ 
1304/DC1). The decision of whose. DNA to 
sequence was based on a complex mix of fac- 
. tors, including the goal of achieving diversity as 
well as technical issues such as the quality ot 
the DNA libraries and availability of immortal- 
ized cell lines. 

1.1 Library construction and 
sequencing 

Central to the whole-genome shotgun sequenc- 
ing process is preparation of high-quality plas- 
mid libraries in a variety of insert sizes so that 
pairs of sequence reads (mates) are obtained, 
one read from both ends of each plasmid insert. 
High-quality libraries have an equal representa- 
tion of all parts of the genome, a small number 
of clones without inserts, and no contamination 
from such sources as the mitochondrial genome 
and Escherichia coli genomic DNA. DNA from 
each donor was used to construct plasmid librar- 
ies in one or more of three size classes: 2 kbp, 10 
kbp, and 50 kbp (Table 1) (33). 

In designing the DNA-sequencing pro- 
cess, we focused on developing a simple 
system that could be implemented in a robust 
and reproducible manner and monitored et- 
fectively (Fig. 2) (34). 

Current sequencing protocols are based on 
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the dideoxy sequencing method (55), which 
typically yields only 500 to 750 bp of sequence 
per reaction. This limitation on read length has 
made monumental gains in throughput a pre- 
requisite for the analysis of large eukaryobc 
genomes. We accomplished this at Ifae Celera 
facility, which occupies about 30,000 square 
feet of laboratory space and produces sequence 
S continuously at a rate of 175,000 total 
reads per day. The DNA-sequencing faculty is 
.supported by a high-performance computation- 
al facility (36). . „ ■ 

The process for DNA sequencing was mod- 
ular by design and automated, Intermodule 
sample backlogs allowed four principal 
modules to operate independently: (i) li- 
brary transformation, plating, and colony 
picking; (ii) DNA template preparation; 
(iiiV dideoxy sequencing reaction set-up 
and purification; and (iv) sequence deter- 
mination with the ABI PRISM 3700 DNA 
Analyzer. Because the inputs and outputs 
of each module have been carefully 
matched and sample backlogs are continu- 
ously managed, sequencing has proceeded 
without a single day's interruption since the 
initiation of the Drpsophila project in May 
1999 The ABI 3700 is a fully automated 
capillary array sequencer and as such can 
be operated with a minimal amount ot 
hands-on time, currently estimated at about 
15 min per day. The capillary system also 
facilitates correct associations of sequenc- 
ing traces with samples through the e imi- 
nation of manual sample loading and lane- 
tracking errors associated with slab gels. 
About 65 production staff were hired and 
trained, and were rotated on a regular basis 



through the four production modules. A 
central laboratory information management 
system (LIMS) tracked all sample plates by 
unique bar code identifiers. The facility was 
supported by a quality control team that per- 
formed raw material and in-process testing 
and a quality assurance group with responsi- 
bilities including document control, valida- 
tion, and auditing of the facility. Critical to 
the success of the scale-up was the validation 
of all software and instrumentation, before 
implementation, and production-scale testing 
' of any process changes. 

1.2 Trace processing 

An automated trace-processing pipeline has 
been developed to process each sequence file 
(37). After quality and vector trimming, the 
average trimmed sequence length was 543 
bp, and the sequencing, accuracy was expo- 
nentially distributed with a mean of 99.5% 
and with less than 1 in 1000 reads being less 
than 98% accurate (26). Each trimmed se- 
quence was screened for matches to contam- 
inants including sequences of vector alone, E. 
coli genomic DNA, and human mitochondri- 
al DNA. The entire read for any sequence 
with a significant match to a contaminant was 
discarded. A total of 713 reads matched E. 
coli genomic DNA and 2114 reads matched 
the human mitochondrial genome. 

1.3 Quality assessment and control 
The importance of the base-pair level ac- 
curacy of the sequence data increases as the 
size and repetitive nature of the genome to 
be sequenced increases. Each sequence 
read must be placed uniquely in the ge- 



Table 1. Celera-generated data input into assembly. 




•Insert she and SO are calculated from assembly of mates on contip. 



t% Mates Is based on laboratory tracking of sequencing t 
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nome, and even a modest error rate can 
reduce the effectiveness of assembly. In 
addition, maintaining the validity of mate- 
pair information is absolutely critical for. 
the algorithms described below. Procedural 
controls were established for maintaining 
the validity of sequence mate-pairs as se- 
quencing reactions proceeded through the 
process, including strict rules built into the 
LIMS. The accuracy of sequence data pro- 
duced by the Celera process was validated 
in the course of the Drosophila genome 
project (25). By collecting data for the 
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entire human genome in a single facility, 
we were able to ensure uniform quality 
standards and the cost advantages associat- 
ed with automation, an economy of scale, 
and process consistency. 

2 Genome Assembly Strategy and 
Characterization 

Summary. We describe in this section the two 
approaches that we used to assemble the ge- , 
nome. One method involves the computational 
combination of all sequence reads with shred- 
ded data from GenBank to generate an indepen- 



dent, nonbiased view of the genome. The sec- 
ond approach involves clustering all of the frag- 
ments to a region or chromosome on the basis 
of mapping information. The clustered data 
were then shredded and subjected to computa- 
tional assembly. Both approaches provided es- 

. sentially the same reconstruction of assembled 
DNA sequence with proper order and orienta- 
tion. The second method provided slightly 
greater sequence coverage (fewer gaps) and 
was the principal sequence used for the analysis 

; phase. In addition, we document the complete- 
ness and correctness of this assembly process 
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Fig. 2. Flow diagram for sequencing pipeline. Samples are received 
selected, and processed In compliance with ^f 6 ^^^ 
dures with a focus on quality within and across departments Each 
pmcess has defined inputi and outputs with the capability to exchange 



samples and data with both internal and external entities according to 
defined quality guidelines. Manufacturing pipeline processes, produm 
quality control measures, and responsible parties are indicated and are 
described further in the text 
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and provide a comparison to the public genome 
sequence, which was reconstructed largely by 
an independent BAC-by-BAC approach. Our 
assemblies effectively covered the euchromatic 
^regions of the human chromosomes. More than 
J90% of the genome was in scaffold assemblies 
of 100,000 bp or greater, and 25% of the ge- 
nome was in scaffolds of 10 million bp or 
larger. 



Shotgun sequence assembly is a classic 
example of an inverse problem: given a set 
of reads randomly sampled from a target 
sequence, reconstruct the order and the po- 
sition of those reads in the target. Genome 
assembly algorithms developed for Dro- 
sophila have now been extended to assemble 
the -25-fold larger human genome. Celera as- 
semblies consist of a set of contigs that are 
ordered and oriented into scaffolds that are then 
mapped to chromosomal locations by using 
known markers. The conrigs consist of a col- 
lection of overlapping sequence reads that pro- 
vide a consensus reconstruction for a contigu- 
ous interval of the genome. Mate pairs are a 
central component of the assembly strategy. 
They are used to produce scaffolds in which the 
size of gaps between consecutive conrigs is 
known with reasonable precision. This is ac- 
complished by observing that a pair of reads 
one of which is in one contig, and the other of 
which is in another, implies an orientation and 
distance between the two conrigs (Fig. 3). Fi- 
nally, our assemblies did riot incorporate all 
reads into the final set of reported scaffolds^ 
This set of unincorporated reads is termed 
"chaff," and typically consisted of reads from 
within highly repetitive regions, data from other 
organisms introduced through various routes as 
found in many genome projects, and data of 
poor quality or with untrimmed vector. 



THE HUMAN GENOME 
2.1 Assembly data sets 

We used two independent sets of data for our 
assemblies. The first was a random shotgun 
data set of 27.27 million reads of average length 
543 bp produced at Celera. This consisted 
largely of mate-pair reads from 16 libraries 
constructed from DNA samples taken from five 
different donors. Libraries with insert sizes of 2, 
10 and 50 kbp were used By looking at how 
mate pairs from a library were positioned in 
known sequenced stretches of the genome, we , 
were able to characterize the range of insert 
: sizes in each library and deterrnine a mean and 
standard deviation. Table 1 details the number 
of reads, sequencing coverage, and clone cov- 
erage achieved by the data set. The clone cov- 
erage is the coverage of the genome in cloned 
DNA, considering the entire insert of each 
clone that has sequence from both ends. The 
clone coverage provides a measure of the 
amount of physical DNA coverage of the ge- 
nome. Assuming a genome size of 2.9 Gbp, the 
Celera trimmed sequences gave a 5.1 X cover- 
age of the genome, and clone coverage was 
3.42X, 16.40X, and 18.84X for the 2-, 10-, and 
50-kbp libraries, respectively, for a total of 
38.7X clone coverage. 

The second data set was from the publicly 
funded Human Genome Project (PFP) and is 
primarily derived from BAC clones {30), The 
BAC data input to the assemblies came from a 
download of GenBank on 1 September 2000 
(Table 2) totaling 4443.3 Mbp of sequence. 
The data for each BAC is deposited at one of 
four levels of completion. Phase 0 data are a set . 
• of generally unassembled sequencmg reads 
from a very light shotgun of the BAC, typically 
less than IX. Phase 1 data are unordered as- 
semblies of contigs, which we call BAC contigs 
or bactigs. Phase 2 data are ordered assemblies 
of bactigs. Phase 3 data are complete BAC 
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sequences. In the past 2 years the PFP has 
focused on a product of lower quality and com- 
pleteness, but on a faster time-course, by con- 
centrating on the production of Phase 1 data 
from a 3X to 4X light-shotgun of each BAC 
clone. 

V/e screened the bactig sequences for con- 
taminants by using the BLAST algorithm 
against three data sets: (i) vector sequences 
in Univec core (55), filtered for a 25-bp 
match' at 98% sequence identity at the ends 
of the sequence and a 30-bp match internal . 
to the sequence; *(ii) the nonhuman portion 
of the High Throughput Genomic (HTG) 
Seqences division of GenBank (3 P), fil- 
tered at 200 bp at 98%; and (iii) the non- 
redundant nucleotide sequences from Gen- 
Bank without primate and human virus en- 
tries, filtered at 200 bp at 98%. Whenever 
25 bp or more of vector was found within 
50 bp of the end of a contig, the tip up to 
the matching vector was excised. Under 
these criteria we removed 2.6 Mbp of pos- 
sible contaminant and vector from the 
Phase 3 data, 61.0 Mbp from the Phase 1 
and 2 data, and 16.1 Mbp from the Phase 0 
data (Table 2). This left us witr/a total of 
4363.7 Mbp of PFP sequence data 20% 
finished, 75% rough-draft (Phase 1 and 2), 
and 5% single sequencing reads (Phase 0). 
An additional 104,018 BAC end-sequence 
mate pairs were also downloaded and in- 
cluded in the data sets for both assembly 
processes {18). 

2.2 Assembly strategies 

Two different approaches to assembly were 
pursued. The first was a whole-genome as- 
sembly process that used Celera data and the 
PFP data in the form of additional synthetic 
shotgun data, and the second was a compart- 
mentalized assembly process that first parti- 
tioned the Celera and PFP data into sets 
localized to large chromosomal segments and 
then performed ab initio shotgun assembly on 
each set. Figure 4 gives a schematic of the 
overall process flow. 

For the whole-genome assembly, the PFP 
data was first disassembled or "shredded" into a 
synthetic shotgun data set of 550-bp reads that 
form a perfect 2X covering of.the bactigs. Tlus 
resulted in 16.05 million "faux" reads that were 
sufficient to cover the genome 2.96X because 
of redundancy in the BAC data set, without 
incorporating the biases inherent in the PFP 
assembly process. The combined data set of 
43 32 million reads (8X), and all associated 
mate-pair information, were then subjected to 
our whole-genome assembly algorithm to pro- 
duce a reconstruction of the genome. Neither 
the location of a BAC in the genome nor its 
assembly of bactigs was used in this process. 
Bactigs were shredded into reads because we 
found strong evidence that 2.13% of them were 
misassembled {40). Furthermore, BAC location 
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information was ignored because some BACs 
were not correctly placed on the PFP physical 
map and because we found strong evidence that 
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at least 22% of the BACs contained sequence 
data mat were not part of the given BAC (41), 
possibly as a result of sample-tracking errors 
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(see below). In short, we performed a true, ab 
initio whole-genome assembly in which wc 
took the expedient of deriving additional se- 
quence coverage, but not mate pairs, assembled 
bactigs, or genome locality, from some exter- 
nally generated data. 

In the compartmentalized. shotgun assembly 
(CSA), Celera and PFP data were partitioned 
into the largest possible chromosomal segments . 
or "components" that could be determined with 
confidence, and then shotgun assembly was ap- 
plied to each partitioned subset wherein the 
bactig data were again shredded into faux reads 
to ensure an independent ab initio assembly or 
the component By subsetting the data in this 
way, the overall computational effort was re- 
duced and the effect of interchromosomal dupli- 
cations was ameliorated. This also resulted in a 
reconstruction of the genome that was relatively , 
independent of the whole-genome assembly re- ; 
suits so that the two assemblies could be com- 
pared for consistency. The.quality of the parti- 
tioning into components was crucial so dial 
different genome regions were not mixed to- 
gether We constructed components from (i) the 
longest scaffolds of the sequence from each 
BAC and (ii) assembled scaffolds of data unique 
to Celera's data set. The BAC assemblies were 
obtained by a combining assembler that used the 
bactigs and the 5X Celera data mapped to those 
bactigs as input. This effort was undertaken as 
an interim step solely because the more accurate 
and complete the scaffold for a given sequence 
stretch, the more accurately one can tile these 
scaffolds into contiguous components on * 
basis of sequence overlap and mate-^ir ml* 
nation. We further visually 
rated the scaffold tiling of the components lc 
further increase its accuracy. For the final CSA 
assembly, all but the partitioning «^ 
and an independent, ab initio reconstruct on o 
the sequence in each component was obt J 
by applying our whole-genome assembly a t 
riduntc the partitioned, relevant Celera dtn tm 
Eiredded, faux reads of the partitioned, rcl 
evant bactig data. 

2 3 Whole-genome assembly 
The algorithms used for whole-genome* 
sembly (WGA) of the human genome ^ w 
enhancements to those used to produce 
sequence of the Drosophila genome rcpon 
in detail in (28). . j; 

The WGA assembler cons.sts of o V £ 

composed of five P^^L^"^ c( h 
Overlapper, Unitigger, Scaffolder and K£ 
Resolver, respectively The Sere *nt 
and marks all microsatell.te repeats vn» 
than a 6-b P element, and screen ou 
known interspersed repeat dements, n 
ing AH Line, and ribosomal DNA. M 
regions get searched for over aps ^ 

screened regions do not get searchc . * < 
be part of an overlap that involves unscr 
matching segments. 
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The Overlapper compares every read 
against every other read in search of complete 
end-to-end overlaps of at least 40 bp and with 
to more than 6% differences in the match, 
because all data are scrupulously vector- 
trimmed, the Overlapper can insist on com- 
plete overlap matches. Computing the set of 
all overlaps took roughly 10,000 CPU hours 
with a suite of four-processor Alpha SMPs 
with 4 gigabytes of RAM. This took 4 to 5 
days in elapsed time with 40 such machines 
operating in parallel.- . . . ■ j 

Every overlap computed above is statisti- 
cally a l-in-10 17 event and thus not a coinci- 
dental event. What makes assembly combi- 
natorially difficult is that while many over- 
laps are actually sampled from overlapping / 
i regions of the genome, and thus imply that 

/ the sequence reads should be assembled to- 

gether, even more overlaps are actually from 
two distinct copies of a low-copy repeated 
element not screened above, thus constituting 
an error if put together. We call the former 
"true overlaps" and the latter "repeat-induced 
overlaps." The assembler must avoid choos- 
ing repeat-induced overlaps, especially early 
in the process. 

We achieve this objective in the Unitig- 
ger. We first find all assemblies of reads that 
appear to be uncontested with respect to all 
other reads. We call the contigs formed from 
these subassemblies unitigs (for uniquely as- 
sembled contigs). Formally/these unitigs are 
the uncontested interval subgraphs of the 
graph of all overlaps (42). Unfortunately, al- 
though empirically many of these assemblies 
are correct (and thus involve only true over- 
laps), some are in fact collections of reads 
from several copies of a repetitive element 
that have been overcollapsed into a single 
subassembly. However, the overcollapsed 
unitigs are easily identified because their av- 
erage coverage depth is too high to be con- 
sistent with the overall level of sequence 
coverage. We developed a simple statistical 
discriminator that gives the logarithm of the 
odds ratio that a unitig is composed of unique 
DNA or of a repeat consisting of two or more 
copies. The discriminator, set to a sufficiently 
stringent threshold, identifies a subset of the 
unitigs that we are certain are correct. In 
addition, a second, less stringent threshold 
identifies a subset of remaining unitigs very 
likely to be correctly assembled, of which we 
select those that will consistently scaffold 
(see below), and thus are again almost certain 
to be correct. We call the union of these two 
sets U-unitigs. Empirically, we found from a 
6X simulated shotgun of human chromosome 
22 that we get U-unitigs covering 98% of the 
stretches of unique DNA that .are >2 kbp 
long. We are further able to identify the 
boundary of the start of a repetitive element 
at the ends of a U-unitig and leverage this so 
that U-unitigs span more than 93% of all 
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singly interspersed Alu elements and other 
100-to 400-bp repetitive segments. 

The result of running the Unitigger was 
thus a set of correctly assembled subcontigs 
covering an estimated 73.6% of the human 
genome. The Scaffolder then proceeded to 
use mate-pair information to link these to- 
gether' into scaffolds. When there are two or 
more mate pairs that imply that a given pair 
of U-unitigs are at a certain distance and 
orientation with, respect to each other, the ; 
probability ^ of this being wrong Ms again* ' 
roughly 1 in 10 10 , assuming that mate pairs 
are false less than 2% of the time. Thus, one 
can with high confidence link together all 
U-unitigs that are linked by at least two 2- or 
10-kbp mate pairs producing intermediate- 
sized scaffolds that are then recursively 
linked together by confirming. 50-kbp mate 
pairs and BAC end sequences. This process 
yielded scaffolds that are on the order of 
megabase pairs in size with gaps between 
their contigs that generally correspond to re- 
petitive elements and occasionally to small 
sequencing gaps. These scaffolds reconstruct 
the majority of the unique sequence within a 
genome. 

For the Drosophila assembly, we engaged 
in a three-stage repeat resolution strategy 
where each stage -was progressively more 



5.1 1X Cetera Reads 
39X mate pairs 



aggressive and thus more likely to make a 
mistake. For the human assembly, we contin- 
ued to use the first "Rocks" substage where 
all unitigs with a good, but not definitive, 
discriminator score are placed in a scaffold 
gap. This was done with the condition that 
two or more mate pairs with one of their 
reads already in the scaffold unambiguously 
place the unitig in the given gap. We estimate 
the. probability of inserting a unitig into an 
incorrect gap with this strategy to be less than 
10~ 7 based on a probabilistic analysis. 

We revised the ensuing "Stones" substage 
of the human assembly, making it more like 
the mechanism suggested in our earlier work 
; (43). For each gap, every read R that is placed 
in the gap by virtue of its mated pair M being 
in a contig of the scaffold and implying R's 
placement is collected. Celera's mate-pairing 
information is correct more than 99% of the 
time. Thus, almost every, but not all, of the 
reads in the set belong in the gap, and when 
a read does not belong it rarely agrees with 
the remainder of the reads. Therefore, we 
simply assemble this set of reads within the 
. gap, eliminating any reads that conflict with 
the assembly. This operation proved much 
more reliable than the one it replaced for the 
Drosophila assembly; in the assembly of a 
. simulated shotgun data set of human chromo- 
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some 22, all stones were placed correctly. 

The final method of resolving gaps is to 
fill them with assembled B AC data that cover 
the gap. We call this external gap "walking." 
We did not include the very aggressive "Peb- 
bles" substage described in our Drosophila 
work, which made enough mistakes so as to 
produce repeat reconstructions for long inter- . 
spersed elements whose quality was . only 
99.62% correct. We decided that for the hu- 
man genome it was philosophically better not 
to introduce a step that was certain to produce 
less than 99.99% accuracy. The cost was a 
somewhat larger number of gaps of some- 
what larger size. 

• At the final stage of the assembly process, 
and also at several intermediate points, a 
consensus sequence of every contig is pro- 
duced. Our algorithm is driven by the princi- 
ple of maximum parsimony, with quality- 
value-weighted measures for evaluating each 
base. The net effect is a Bayesian estimate of 
the correct base to report at each position. 
Consensus generation uses Celera data when- 
ever it is present. In the event that no Celera 
data cover a given region, the BAC data 
sequence is used. 

A key element of achieving a WGA of the 
human genome was to parallelize the Overlap- 
per and the central consensus sequence-con- 
structing subroutines. In addition, memory was 
a . real issue — a straightforward application of 
the software we had built for Drosophila would 



have required a computer with a 600-gjgabyte 
RAM. By making the Overlapper and Unitigger 
incremental, we were able to achieve the same 
computation with a maximum of instantaneous 
usage of 28 gigabytes of RAM. Moreover, the 
incremental nature of the first three stages al- 
lowed us to continually update the state of this 
part of .the computation as data were delivered 
and then perform a 7-day run to complete Scaf- 
folding and Repeat Resolution whenever de- 
sired. For our assembly operations, the total 
compute iiifrastructure consists of 10 four-pro- 
cessor SMPs with 4 gigabytes of memory per 
cluster (Compaq's ES40, Regatta) and a 1 co- 
processor NUMA machine with 64 gigabytes 
of memory (Compaq's GS160, Wildfire). The 
total compute for a run of the assembler was 
roughly 20,000 CPU hours. 

The assembly of Celera's data, together 
with the shredded bactig data, produced a set of 
scaffolds totaling 2.848 Gbp in span and con- 
sisting of 2.586 Gbp of sequence. The chaff, or 
set of reads not incorporated in the assembly, 
numbered 11.27 million (26%), which is con- 
sistent with our experience for Drosophila. 
More than 84% of the genome was covered by 
scaffolds >100 kbp long, and these averaged 
91% sequence and 9% gaps with a total of 
2.297 Gbp of sequence. There were a total of 
93,857 gaps among the 1637 scaffolds >100 
kbp. The average scaffold size was 1.5 Mbp, 
the average contig size was 24.06 kbp, arid the 
average gap size was 2.43 kbp, where the dis- 



tribution of each was essentially exponential 
More than 50% of all gaps were less than 50C 
bp long, >62% of all gaps were less than 1 kbp 
long, and no gap. was >100 kbp long. Similar- 
ly, more than 65% of the sequence is in contigs 
>30 kbp, more than 31% is in contigs >100 
kbp, and the largest contig was 1.22 Mbp long. 
Table 3 gives detailed summary statistics for 
the structure of this assembly with a direct 
comparison to the compartmentalized shotgun 
assembly.- 

2.4 Compartmentalized shotgun 
assembly 

In addition to the WGA approach, we pur- 
sued a localized assembly approach that was 
intended to subdivide the genome into seg- 
ments, each of which could be shotgun as- 
sembled individually. We expected that this 
would help in resolution of large interchro- 
mosomal duplications and improve the statis- 
tics for calculating U-unitigs. The compart- 
mentalized assembly process involved clus- 
tering Celera reads and bactigs into large, 
multiple raegabase regions of the genome, 
and then running the WGA assembler on the 
Celera data and shredded, faux reads ob- 
tained from the bactig data. 

The first phase of the CSA strategy was to 
separate Celera reads into those that matched 
the BAC contigs for a particular PFP BAC 
entry, and those that did not match any public 
data. Such matches must be guaranteed to 
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properly place a Celera read, so all reads were 
first masked against a library of common 
repetitive elements, and only matches of at 
least 40 bp to unmasked portions , of the read 

•tituted a hit. Of Celera's 27.27 million 
s, 20.76 million matched a bactig and 
another 0.62 million reads, which did not 
have any matches, were nonetheless identi- 
fied as belonging in the region of the bactig's 
BAC because their mate matched the bactig. 
Of the remaining reads, 2.92 million were , 
completely screened out and so could not be 
• matched, but the other 2.97 million reads had - 
* unmasked sequence totaling 1.189 Gbp that 
were not found in the GenBank data set. 
Because the Celera data are 5.1 1 X redundant, 
we estimate that 240 Mbp of unique Celera 
sequence is not in the GenBank data set. 

In the next step of the CSA process, a 
combining assembler took the relevant 5X 
Celera reads and bactigs for a BAC entry, and 
produced an assembly of the combined data 
for that locale. These high-quality sequence 
reconstructions were a transient result whose 
utility was simply to provide more reliable 
information for the purposes of their tiling 
into sets of overlapping and adjacent scaffold 
sequences in the next step. In outline, the 
combining assembler first examines the set of 
matching Celera reads to determine if there 
are excessive pileups indicative of un- 
screened repetitive elements. Wherever these 
occur, reads in the repeat region whose mates 
ive not been mapped to consistent positions 
p/e removed. Then all sets of mate pairs that 
consistently imply the same relative position 
of two bactigs are bundled into a link and 
weighted according to the number of mates in 
the bundle. A "greedy" strategy then attempts 
to order the bactigs by selecting bundles of 
mate-pairs in order of their weight. A selected 
mate-pair bundle can tie together two forma- 
tive scaffolds. It is incorporated to form a 
single scaffold only if it is consistent with the 
majority of links between contigs of the scaf- 
fold. Once scaffolding is complete, gaps are 
filled by the "Stones" strategy described 
above for the WGA assembler. 

The GenBank data for the Phase 1 and 2 
BACs consisted of an average of 19.8 bactigs 
per BAC of average size 8099 bp. Applica- 
tion of the combining assembler resulted in 
individual Celera BAC assemblies being put 
together into an average of 1.83 scaffolds 
(median of 1 scaffold) consisting of an aver- 
age of 8.57 contigs of average size 18,973 bp. 
In addition to defining order and orientation 
of the sequence fragments, there were 57% 
fewer gaps in the combined result. For Phase 
0 data, the average GenBank entry consisted 
of 91.52 reads of average length 784 bp. 
Application of the combining assembler re- 
sulted in an average of 54.8 scaffolds consist- 
ing of an average of 58.1 contigs of average 
size 873 bp. Basically, some small amount of 



assembly took place, but not enough Celera 
data were matched to truly assemble the 0.5 X 
to IX data set represented by the typical 
Phase 0 BACs. The combining assembler 
was also applied to the Phase 3 BACs for 
SNP identification, confirmation of assem- 
bly, and localization of the Celera reads. The 
phase 0 data suggest that a combined whole- 
genome shotgun data set and IX light-shot- 
gun of BACs will not yield good assembly of 
BAC regions; at least 3 X light-shotgun of 
each BAC is needed. ... 
. The 5.89 million Celera fragments not 
matching the GenBank data were assembled 
with our whole-genome assembler. The as- 
sembly resulted in a set of scaffolds totaling 
442 Mbp in span and consisting of 326 Mbp 
of sequence. More than 20% of the scaffolds 
were >5 kbp long, and these averaged 63% 
sequence and 27% gaps with a total of 302 
Mbp of sequence. All scaffolds >5 kbp were 
forwarded along with all scaffolds produced 
by the combining assembler to the subse- 
quent tiling phase. 

At this stage, we typically had one or two 
' scaffolds for every BAC region constituting 
at least 95% of the relevant sequence, and a 
collection of disjoint Celera-unique scaffolds. 
The next step in developing the genome com- 
ponents was to determine the order and over- 
lap tiling of these BAC and Celera-unique 
scaffolds across the genome. For this, we 
used Celera's 50-kbp mate-pairs information, 
and B AC-end pairs (18) and sequence tagged 
. site (STS) markers (44) to provide long- 
range guidance and chromosome separation. 
Given the relatively manageable number of 
scaffolds, we chose not to produce this tiling 
in a fully automated manner, but to compute 
an initial tiling with a good heuristic and then 
use human curators to resolve discrepancies 
or missed join opportunities. To this end, we 
developed a graphical user interface that dis- 
played the graph of tiling overlaps and the 
evidence for each. A human curator could 
then explore the implication of mapped STS 
data, dot-plots of sequence overlap, and a 
visual display of the mate-pair evidence sup- 
porting a given choice. The result of this 
process was a collection of "components," 
where each component was a tiled set of 
BAC and Celera-unique scaffolds that had 
been curator-approved. The process resulted 
in 3845 components with an estimated span 
of 2.922 Gbp. 

In order to generate the final CSA, we 
assembled each component with the WGA 
algorithm. As was done in the WGA process, 
the bactig data were shredded into a synthetic 
2X shotgun data set in order to give the 
assembler the freedom to independently as- 
semble the data. By using faux reads rather 
than bactigs, the assembly algorithm could 
correct errors in the assembly of bactigs and 
remove chimeric content in a PFP data entry. 



Chimeric or contaminating sequence (from 
another part of the genome) would not be 
incorporated into the reassembly of the com- 
ponent because it did not belong there. In 
effect, the previous steps in the CSA process 
served only to bring together Celera frag- 
ments and PFP data relevant to a large con- 
tiguous segment of the genome, wherein we 
applied the assembler used for WGA to pro- 
duce an ab initio assembly of the region. 

WGA assembly of the components result- 
ed in a set of scaffolds totaling 2:906. Gbp in , 
span and consisting of .2.654 Gbp of se- 
quence. The chaff, or set of reads not incor- 
porated into the assembly, numbered 6.17 
million, or 22%. More than 90.0% of the 
genome was covered by scaffolds spanning 
>100 kbp long, and these averaged 92.2% 
sequence and 7.8% gaps with a total of 2.492 
Gbp of sequence. There were a total of 
105,264 gaps among the 107,199 contigs that 
belong to the 1940 scaffolds spanning >100 
kbp. The average scaffold size was 1.4 Mbp, 
the average contig size was 23.24 kbp, and 
the average gap size was 2.0 kbp where each 
distribution of sizes was exponential. As 
such, averages tend to be underrepresentative 
of the majority of the data. Figure 5 shows a 
histogram of the bases in scaffolds of various 
size ranges. Consider also that more than 
49% of all gaps were <500 bp long, more 
than 62% of all gaps were <1 kbp, and all 
gaps are <100 kbp long. Similarly, more than 
73% of the sequence is in contigs > 30 kbp, 
more than 49% is in contigs > 100 kbp, and 
the largest contig was 1.99 Mbp long. Table 3 
provides summary statistics for the structure 
of this assembly with a direct comparison to 
the WGA assembly. 



2.5 Comparison of the WGA and CSA 
scaffolds 

Having obtained two assemblies of the hu- 
man genome via independent computational 
processes (WGA and CSA), we compared 
scaffolds from the two assemblies as another 
means of investigating their completeness, 
consistency, and contiguity. From each as- 
sembly, a set of reference scaffolds contain- 
ing at least 1000 fragments (Celera sequenc- 
ing reads or bactig shreds) was obtained; this 
amounted to 2218 WGA scaffolds and 1717 
CSA scaffolds, for a total of 2.087 Gbp and 
2.474 Gbp. The sequence of each reference 
scaffold was compared to the sequence of all 
scaffolds from the other assembly with which 
it shared at least 20 fragments or at least 20% 
of the fragments of the smaller scaffold. For 
each such comparison, all matches of at least 
200 bp with at most 2% mismatch were 
tabulated. 

From this tabulation, we estimated the 
amount of unique sequence in each assembly 
in two ways. The first was to determine the 
number of bases of each assembly that were 
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not covered by a matching segment in the 
other assembly. Some 82.5 Mbp of the WGA 
(3.95%) was not covered by the CSA, where- 
as 204.5 Mbp (8.26%) of the CSA was not 
covered by the WGA. This estimate did not 
require any consistency of the assemblies or 
any uniqueness of the matching segments. 
Thus, another analysis was conducted in 
which matches of less than 1 kbp between a 
pair of scaffolds were excluded unless they 
were confirmed by other matches having a : 
consistent order and orientation. This gives 
■■ some measure of consistent coverage: 1.982 
Gbp (95.00%) of the WGA is covered by the 
CSA, and 2.169 Gbp (87.69%) of the CSA is 
covered by the WGA by this more stringent 
measure. , 7 : 

The comparison of WGA to CSA also 
permitted evaluation of scaffolds for structur- 
al inconsistencies. We looked for instances in 
which a large section of a scaffold from one 
assembly matched only one scaffold from the 
other assembly, but failed to match over the 
full length of the overlap implied by the 
matching segments. An initial set of candi- 
dates was identified automatically, and then 
each candidate was inspected by hand. From 
this process, we identified 31 instances in 
which the assemblies appear to disagree in a 
nonlocal fashion. These cases are being fur- 
ther evaluated to determine which assembly 
is in error and why. 

In addition, we evaluated local inconsis- 
tencies of order or orientation. The following 
results exclude cases in which one contig in 
one assembly corresponds to more than one 
overlapping contig in the other assembly (as 
long as the order and orientation of the latter 
agrees with the positions they match in the 
former). Most of these small rearrangements 
involved segments on the order of hundreds 
of base pairs and rarely >1 kbp. We found a 
total of 295 kbp (0.012%) in the CSA assem- 
blies that were locally inconsistent with the 
WGA assemblies, whereas 2.108 Mbp 
(0.11%) in the WGA assembly were incon- 
sistent with the CSA assembly. 
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The CSA assembly was a few percentage 
points better in terms of coverage and slightly 
more consistent than the WGA, because it 
was in effect performing a few thousand shot- 
gun assemblies of megabase-sized problems, 
whereas the WGA is performing a shotgun 
assembly of a gigabase-sized problem. When 
one considers the increase of two-and-a-half 
orders of magnitude in problem size, the in- 
formation loss between the two is remarkably 
small. Because CSA was logistically easier to 
deliver and the better of the two results avail- 
able at the time when downstream analyses 
needed to be begun, all subsequent analysis 
was performed on this assembly. ■ 

2.6 Mapping scaffolds to the genome 

The final step in assembling the genome was to 
order and orient the scaffolds on the chromo- 
somes. We first grouped scaffolds together on 
the basis of their order in the components from 
CSA. These grouped scaffolds were reordered 
by exarnining residual mate-pairing data be- 
tween the scaffolds. We next mapped the scaf- 
fold groups onto the chromosome using physi- 
cal mapping data. This step depends on having 
reliable high-resolution map information such 
that each scaffold will overlap multiple mark- 
ers. There are two genome-wide types of map 
information available: high-density STS maps 
and fingerprint maps of BAC clones developed 
at Washington University {45), Among the ge- 
nome-wide STS maps, GeneMap99 (GM99) 
has the most markers and therefore was most 
useful for mapping scaffolds. The two different 
mapping approaches are complementary to one 
another. The fingerprint maps should have bet- 
ter local order because they were built by com- 
parison of overlapping BAC clones. On the 
other hand, GM99 should have a more reliable 
long-range order, because the framework mark- 
ers were derived from well-validated genetic 
maps. Both types of maps were used as a 
reference for human curation of the compo- 
nents that were the input to the regional assem- 
bly, but they did not determine the order of 
sequences produced by the assembler. 
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Scaffold Size 

Fig. 5. Distribution of scaffold sizes of the CSA. For each range of scaffold sizes, the percent of total 
sequence is indicated. 



In order to determine the effectiveness of 
the fingerprint maps and GM99 for mapping 
scaffolds, we first examined the reliability of 
these maps by comparison with large scaf- 
folds. Only 1% of the STS markers on the 10 
largest scaffolds (those >9 Mbp) were 
mapped on a different chromosome on 
GM99. Two percent of the STS markers dis- 
agreed in position by more than five frame- 
work bins. However, for the fingerprint 
maps, a 2% chromosome discrepancy was 
observed, and on average 23.8% of BAC 
locations in the scaffold sequence disagreed 
with fingerprint map placement by more than 
five BACs. When further examining the 
source of discrepancy, it was found that most 
of the discrepancy came from 4 of the 10. 
scaffolds, indicating this there is variation in 
the quality of either the map or the scaffolds. 
All four scaffolds were assembled, as well as 
the other six, as judged by clone coverage 
analysis, and showed the same low discrep- 
ancy rate to GM99, and thus we concluded 
that the fingerprint map global order in these 
cases was not reliable. Smaller scaffolds had 
a higher discordance rate with GM99 (4.21% 
of STSs were discordant by more than five 
framework bins), but a lower discordance rate 
with the fingerprint maps (11% of BACs 
disagreed with fingerprint maps by more than 
five BACs). This observation agrees with the 
clone coverage analysis (46) that Celera scaf- 
fold construction was better supported by 
long-range mate pairs in larger scaffolds than 
in small scaffolds. 

We created two orderings of Celera scaf- 
folds on the basis of the markers (BAC or 
STS) on these maps. Where the order of 
scaffolds agreed between GM99 and the 
WashU BAC map, we had a high degree of 
confidence that that order was correct; these 
scaffolds were termed "anchor scaffolds." 
Only scaffolds with a low overall discrepancy 
rate with both maps were considered anchor 
scaffolds. Scaffolds in GM99 bins were al- 
lowed to permute in their order to match 
WashU ordering, provided they did not vio- 
late their framework orders. Orientation of 
individual scaffolds was determined by the 
presence of multiple "mapped markers with . 
consistent order. Scaffolds with only one 
marker have insufficient information to as- 
sign orientation. We found 70.1% of the ge- 
nome in anchored scaffolds, more than 99% : 
of which are also oriented (Table 4). Because 
GM99 is of lower resolution than the WashU 
map, a number of scaffolds without STS 
matches could be ordered relative to the an- 
chored scaffolds because they included se- 
quence from the same or adjacent BACs on 
the WashU map. On the other hand, because 
of occasional WashU global ordering dis- 
crepancies, a number of scaffolds determined 
to be i4 unmappable" on the WashU map could 
be ordered relative to the anchored scaffolds 
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with GM99. These scaffolds were termed 
"ordered scaffolds." We found that 13.9% of 
the assembly could be ordered by these ad- 
ditional methods, and thus 84.0% of the ge- 
^^was ordered unambiguously. 
^Plxt, all scaffolds that could be placed, 
but not ordered, between anchors were as- 
signed to the interval between the anchored 
scaffolds and were deemed to be "bound- 
ed" between them. For example, small scaf- 
folds having STS hits from the same Gene- 
Map bin or hitting the same B AC cannot be 
ordered relative to each other, but can be 
assigned a placement boundary relative to 
other anchored or ordered scaffolds. The 
remaining scaffolds either had no localiza- 
tion information, conflicting information, 
or could only be assigned to a generic 
chromosome location. Using the above ap- 
proaches, -98% of the genome was an- 
chored, ordered, or bounded. 

Finally, we assigned a location for each 
scaffold placed on the chromosome by 
spreading out the scaffolds per chromosome. 
We assumed that the remaining unmapped 
scaffolds, constituting 2% of the genome, 
were distributed evenly across the genome. 
By dividing the sum of unmapped scaffold 
lengths with the sum of the number of 
mapped scaffolds, we arrived at an estimate 
of interscaffold gap of 1483 bp. This gap was 
used to separate all the scaffolds on each 
chromosome and to assign an offset in the 

^BBuring the scaffold-mapping effort, we en- 
countered many problems mat resulted in addi- 
tional quality assessment and validation analy- 
sis. At least 978 (3% of 33,173) BACs were 
believed to have sequence data from more than 
one location in the genome (47). This is con- 
sistent with the bactig chimerism analysis re- 
ported above in the Assembly Strategies sec- 
tion. These BACs could not be assigned to 
unique positions within the CSA assembly and 
thus could not be used for ordering scaffolds. 
Likewise, it was not always possible to assign 
STSs to unique locations in the assembly be- 
cause of genome duplications, repetitive ele- 
ments, and pseudogenes. 

Because of the time required for an ex- 
haustive search for a perfect overlap, CSA 
generated 21,607 intrascaffold gaps where 
the mate-pair data suggested that the contigs 
should overlap, but no overlap was found. 
These gaps were defined as a fixed 50 bp in 
length and make up 18.6% of the total 
1 16,442 gaps in the CSA assembly. 

We chose not to use the order of exons 
implied in cDNA or EST data as a way of 
ordering scaffolds. The rationale for not us- 
ing this data was that doing so would have 

€ised certain regions of the assembly by 
^ranging scaffolds to fit the transcript data 
5 made validation of both the assembly and 
gene definition processes more difficult. 
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2.7 Assembly and validation analysis 

We analyzed the assembly of the genome 
from the perspectives of completeness 
(amount of coverage of the genome) and 
correctness (the structural accuracy of the 
order and orientation and the consensus se- 
quence of the assembly). 

Completeness. Completeness is defined as 
the percentage of the euchromatic sequence 
represented in the assembly. This cannot be 
known with absolute certainty until the eu- 
chromatin . sequence has been completed. ; 
However, it is possible to estimate complete- 
ness on the basis of (i) the estimated sizes of 
intrascaffold gaps; (ii) coverage of the two 
published chromosomes, 21 and 22 (48, 49); 
and (iii) analysis of the percentage of an 
independent set of random sequences (STS 
markers) contained in the assembly. The 
whole-genome libraries contain heterochro- 
matic sequence and, although no attempt has 
been made to assemble it, there may be in- 
stances of unique sequence embedded in re- 
gions of heterochromatin as were observed in 
Drosophila (50, 51). 

The sequences of human chromosomes 21 
and 22 have been completed to high quality 
and published (48, 49\ Although this se- 
quence served as input to the assembler, the 
finished sequence was shredded into a shot- 
gun data set so that the assembler had the 
opportunity to assemble it differently from 
the original sequence in the case of structural 
polymorphisms or assembly errors in the 
BAC data. In particular, the assembler must 
be able to resolve repetitive elements at the 
scale of components (generally multimega- 
base in size), and so this comparison reveals 
the level to which the assembler resolves 
repeats. In certain areas, the assembly struc- 
ture differs from the published versions of 
chromosomes 21 and 22 (see below). The 
consequence of the flexibility to assemble 
"finished" sequence differently on the basis 
of Celera data resulted in an assembly with 
more segments than the chromosome 21 and 
22 sequences. We examined the reasons why 
there are more gaps in the Celera sequence 
than in chromosomes 21 and 22 and expect 
that they may be typical of gaps in other 
regions of the genome. In the Celera assem- 
bly, there are 25 scaffolds, each containing at 
least 10 kb of sequence, that collectively span 
94.3% of chromosome 21. Sixty-two scaf- 
folds span 95.7% of chromosome 22. The 
total length of the gaps remaining in the 
Celera assembly for these two chromosomes 
is 3.4 Mbp. These gap sequences were ana- 
lyzed by RepeatMasker and by searching 
against the entire genome assembly (52). 
About 50% of the gap sequence consisted of 
common repetitive elements identified by Re- 
peatMasker; more than half of the remainder 
was lower copy number repeat elements. 
A more global way of assessing complete- 



ness is to measure the content of an independent 
set of sequence data in the assembly. We com- 
pared 48,938 STS markers from Genemap99 
(51) to the scaffolds. Because "these markers 
were not used in the assembly processes, they 
provided a truly independent measure of com- 
pleteness. ePCR (53) .and BLAST (54) were 
used to locate STSs on the assembled genome. 
We found 44,524 (91%) of the STSs in the 
mapped genome. An additional 2648 markers 
(5.4%) were found by searching the. unas- 
sembled data' or "chaff" We identified 1283 
STS markers (2.6%) not found in either Celera . 
sequence or BAC data as of September 2000, 
raising the possibility that these markers may 
not be of human origin. If that were the case, 
the Celera assembled sequence would represent 
93.4% of the human genome and the unas- 
sembled data 5.5%, for a total of 98.9% cover- 
age. Similarly, we compared CSA against 
36,678 TNG radiation hybrid markers (55a) 
using the same method We found that 32,371 
markers (88%) were located in the mapped 
CSA scaffolds, with 2055 markers (5.6%) 
found in the remainder. This gave a 94% cov- 
erage of the genome through another genofhe- 
wide survey. 

Correctness. Correctness is defined as the 
structural and sequence accuracy of the as- 
sembly. Because the source sequences for the 
- Celera data and the GenBank data are from 
different individuals, we could not directly 
compare the consensus sequence of the as- 
Table 4. Summary of scaffold mapping. Scaffolds 
were mapped to the genome with different levels 
of confidence (anchored scaffolds have the highest 
confidence; unmapped scaffolds have the lowest). 
Anchored scaffolds were consistently ordered by 
the WashU BAC map and GM99. Ordered scaf- 
folds were consistently ordered by at least one of 
the following: the WashU BAC map, CM99, or 
component tiling path. Bounded scaffolds had or- 
der conflicts between at least two of the external 
maps, but their placements were adjacent to a 
neighboring anchored or ordered scaffold. Un- 
mapped scaffolds had, at most, a chromosome 
assignment. The scaffold subcategories are given 
below each category. 
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sembly against other finished sequence for 
determining sequencing accuracy at the nu- 
cleotide level, although this has been done for 
identifying polymorphisms as described in 
Section 6. The accuracy of the consensus 
sequence is at least 99.96% on the basis of a 
statistical estimate derived from the quality 
values of the underlying reads. 

The structural consistency of the assembly 
can be measured by mate-pair analysis. In a 
correct assembly, every mated pair of se- 
quencing reads should be located on the con- 
sensus sequence with the correct separation 
and orientation between the pairs. A pair is 
termed 'Valid" when' the reads are. in the . 
correct orientation, and the distance between 
them is within the mean ± 3 standard devi- 
ations of the distribution of insert sizes of the 
library from which the pair was sampled. A 
pair is termed "misoriented" when the reads 
are not correctly oriented, and is termed "mis- 
separated" when the distance between the 
reads is not in the correct range but the reads 
are correctly oriented. The mean ± the stan- 
dard deviation of each library used by the 
assembler was determined as described 
above. To validate these, we examined all 
reads mapped to the finished sequence of . 
chromosome 21 (48) and determined how 
many incorrect mate pairs there were as a 
result of laboratory tracking errors and chi- 
merism (two different segments of the ge- 
nome cloned into the same plasmid), and how 
tight the distribution of insert sizes was for 
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those that were correct (Table 5). The stan- 
dard deviations for all Celera libraries were 
quite small, less than 15% of the insert 
length, with the exception of a few 50-kbp 
libraries. The 2- and 10-kbp libraries con- 
tained less than 2% invalid mate pairs, where- 
as the 50-kbp libraries were somewhat higher 
(^10%). Thus, although the mate-pair infor- 

. mation was not perfect, its accuracy was such 
that measuring valid, misoriented, and mis- 
separated pairs with respect to a given assem- 
bly was deemed to be a reliable instrument 
for validation purposes, especially when sev- 

, eral mate pairs confirm or deny an ordering. 
The clone coverage of the genome was 

. 39X, meaning that any given base.pair was, 
on average, contained in 39 clones or, equiv- 
alently, spanned by 39 mate-paired reads. 
Areas of low clone coverage or areas with a 
high proportion of invalid mate pairs would . 
indicate potential assembly problems. We 
computed the coverage of each base in the 
assembly by valid mate pairs (Table 6). In . 
summary, for scaffolds >30 kbp in length, 
less than 1% of the Celera assembly was in 
regions of less than 3 X clone coverage. Thus, 
more than 99% of the assembly, including 
order and orientation, is strongly supported 
by this measure alone. 

We examined the locations and number of 
all misoriented and misseparated mates. In 

. addition to. doing this analysis on the CSA 
assembly (as of 1 October 2000), we also 
performed a study of the PFP assembly as of 



5 September 2000 (30, 55b), In this latter 
case, Celera mate pairs had to be mapped to 
the PFP assembly. To avoid mapping errors 
due to high-fidelity repeats, the only pairs 
mapped were those for which both reads 
matched at only one location with less than 
6% differences. A threshold was set such that 
sets of five .or more simultaneously invalid 
mate pairs indicated a potential breakpoint, 
where the construction of the two assemblies 
differed. The graphic comparison of the CSA 
chromosome 21 assembly with the published 
sequence (Fig. 6A) serves as a validation of 
this methodology. Blue tick marks in the 
panels indicate breakpoints. There were a 
similar (small) number of breakpoints on 
both chromosome sequences. The exception 
was 12 sets of scaffolds in the Celera assem- 
bly (a total of 3% of the chromosome length 
in 212 single-contig scaffolds) that were 
mapped to the wrong positions because they 
were too small to be mapped rejiably. Figures 

6 and 7 and Table 6 illustrate the mate-pair 
differences and breakpoints between the two 
assemblies. There was a higher percentage of 
misoriented and misseparated mate pairs in 
the large-insert libraries (50 kbp and BAC 
ends) than in the small-insert libraries in both 
assemblies (Table 6). The large-insert librar- 
ies are more likely to identify discrepancies 
simply because they span a larger segment of 
the genome. The. graphic comparison be- 
tween the two assemblies for chromosome 8 
(Fig. 6, B and C) shows that there are many 



Table 5. Mate-pair validation. Celera fragment sequences were mapped to 
the published sequence of chromosome 21. Each mate pair uniquely 
mapped was evaluated for correct orientation and placement (number 



of mate pairs tested). If the two mates had incorrect relative orienta- 
tion or placement, they were considered invalid (number of invalid mate 
pairs). 



Library 
type 



2 kbp 
10 kbp 

50 kbp 



8ES 



Sum 



Chromosome 21 



Genome 



Library 
no. 


Mean 
insert 
size 
(bp) 


SD 
(bp) 


SD/ 
mean 
(%) 


No. of 
mate 
pairs 

tested 


No. of 
invalid 
mate 
pairs 


% 
invalid 


1 


2,081 


106 


5.1 


3,642 


38 


1.0 


2 


1.913 


152 


7.9 


28,029 


413 


1.5 


3 


2,166 


175 


8.1 


4,405 


57 


1.3 


4 


11,385 


851 


7.5 


4,319 


80 


1.9 


5 


14,523 


1,875 


12.9 


7,355 


156 


2.1 


6 


9,635 


1,035 


10.7 


5,573 


109 


2.0 


7 


10,223 


928 


9.1 


34,079 


399 


1.2 


8 


64,888 


2,747 


4.2 


16 


1 


6.3 


9 


53,410 


5,834 


10.9 


914 


170 


18.6 


10 


52,034 


7312 


14.1 


5,871 


569 


9.7 


11 


52,282 


7,454 


14.3 


2,629 


213 


8.1 


12 


46,616 


. 7.378 


15.8 


2,153 


215 


10.0 


13 


55,788 


10,099 


18.1 


2.244 


249 


11.1 


14 


39,894 


5,019 


12.6 


199 


7 


3.5 


15 


48,931 


9,813 


20.1 


144 


10 


6.9 


16 


48,130 


4,232 


8.8 


195 


14 


7.2 


17 


106,027 


27.778 


26.2 


330 


16 


4.8 


18 


160,575 


54.973 


34.2 


155 


8 


5.2 


19 


164,155 


19,453 


11.9 


642 


44 


6.9 










102.894 


2,768 


2.7 












(mean = 2.7) 





Mean 
insert 
size (bp) 


SD 
(bp) 


SD/ 
mean 
(%) 


2,082 


90 


4.3 


1,923 


118 


6.1 


2,162 


158 


7.3 


11,370 - 


696 


6.1 


14,142 


1,402 


9.9 


9,606 


934 


9.7 


10,190 


777 


7.6 


65,500 


5,504 


8.4 


53,311 


5,546 


10.4 


51,498 . 


6,588 


12.8 


52,282 


7,454 


14.3 


45,418 


9,068' 


20.0 


53,062 


10,893 


20.5 


36,838 


9,988 


27.1 


47,845 


4,774 


10.0 


47,924 


4,581 


9.6 


152,000 


26,600 


17.5 


161,750 


27,000 


16.7 


176,500 


19,500 


11.05 
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gene boundaries. During this process, multiple 
hits to the same region were collapsed to a 
coherent set of data by tracking the coverage of 
a region. For example, if a group of bases was 
represented by multiple overlapping ESTs, the 
union of these regions matched by the set of 
ESTs on the scaffold was marked as being 
supported by EST evidence. This resulted in a 
series of "gene bins" each of which was be- 
lieved to contain a single gene. One weakness of 
this initial implementation of the algorithm was 
in predicting gene boundaries in regions of tan- 
demly duplicated genes. Gene clusters frequent- 
ly resulted in homologous neighboring genes 



being joined together, resulting in an annotation 
that artificially concatenated these gene models. 

Next, known genes (those with exact match- 
es of a ^ill-length cDNA sequence to the ge- 
nome) were identified, and the region corre- 
sponding to the cDNA was annotated as a 
predicted transcript. A subset of the curat- 
ed human gene set RefSeq from the Nation- 
al Center for Biotechnology Information 
(NCBI) was included as a data set searched in 
the computational pipeline. If a RefSeq tran- 
script matched the genome assembly for at least 
50% of its length at >92% identity, then the 
SIM4 (63) alignment of the RefSeq transcript to 



the region of the genome under analysis was 
promoted to the status of an Otto annotation. 
Because the genome sequence has gaps and 
sequence errors such as frameshifts, it was not ' 
always possible to predict a transcript thai 
agrees precisely with the experimentally deter- 
mined cDNA sequence. A total of 6538 genes 
in our inventory were identified and transcripts 
predicted in this way. 

Regions that have a substantial amount of 
sequence similarity, but do not match known 
genes, were analyzed by that part of the Otto 
system that uses the sequence similarity in- 
formation to predict a transcript. Here, Otto 
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Fig. 6. Comparison of the CSA and the PFP assembly. 
(A) All of chromosome 21, (B) all of chromosome 8, 
and (C) a 1-Mb region of chromosome 8 representing 
a single Cetera scaffold. To generate the figure. Celera 
fragment sequences were mapped onto each assem- 
bly. The PFP assembly is indicated in the upper third 
of each panel; the Celera assembly is indicated in the 
lower third. In the center of the panel, green lines 
show Celera sequences that are in the same order and 
orientation in both assemblies and form the longest 
consistently ordered run of sequences. Yellow lines 
indicate sequence blocks that are in the same orien- 
tation, but out of order. Red lines indicate sequence 
blocks that are not in the same orientation. For 
clarity, in the latter two cases, lines are only drawn 
between segments of matching sequence that are at 
least 50 kbp long. The top and bottom thirds of each 
panel show the extent of Celera mate-pair violations 
(red, misoriented; yellow, incorrect distance between 
the mates) for each assembly grouped by library size. 
(Mate pairs that are within the correct distance, as 
expected from the mean library insert size, are omit- 
ted from the figure for clarity.) Predicted breakpoints, 
corresponding to stacks of violated mate pairs of the 
same type, are shown as blue ticks on each assembly 
axis. Runs of more than 10,000 Ns are shown as cyan 
bars. Plots of all 24 chromosomes can be seen in Web 
fig. 3 on Science Online at www.sciencemag.org/cgi/ 
content/fult/291/5507/1304/DCI. 
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evaluates evidence generated by the compu- and cDNAs), similarity to rodent transcripts man genome. The sequence from the region 

tational pipeline, corresponding to conserva- . (ESTs and cDNAs), and similarity of the of genomic DNA contained in a gene bin was 

• tion between mouse and human genomic translation of human genomic DNA to known . extracted, and the subsequences supported by 

DNA, similarity to human transcripts (ESTs proteins to predict potential genes in the hu- any homology evidence were marked (plus 100 
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Fig. 7. Schematic view of the distribution of breakpoints and large gaps 
on all chromosomes. For each chromosome, the upper pair of lines 
represent the PFP assembly, and the lower pair of lines represent Celera's 



assembly. Blue tick marks represent breakpoints, whereas red tick marks 
represent a gap of larger than 10,000 bp. The number of breakpoints per 
chromosome is indicated in black, and the chromosome numbers In red. 
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bases flanking these regions). The other bases 
in the region, those not covered by any homol- 
ogy evidence, were replaced by N's. This se- 
quence segment, with high confidence, regions 
represented by the consensus genomic se- 
quence and the remainder represented by N's, 
was then evaluated by Genscan to see if a 
consistent gene model could be generated. This 
procedure simplified the gene-prediction task 
by first establishing the boundary for the gene 
(not a strength of most gene-finding algo- 
rithms), and by eliminating regions with no 
supporting evidence. If Genscan returned a 
plausible gene model, it was further evaluated 
before being promoted to an "Otto" annotation. 
The final Genscan predictions were often quite 
different from the prediction that Genscan re- 
turned on the same region of native genomic 
sequence. A weakness of using Genscan to 
refine the gene model is the loss of valid, small 
exons from the final annotation. > 

The next step in defining gene structures 
based on sequence similarity was to compare 
each predicted transcript with the homology- 
based evidence that was used in previous steps 
to evaluate the depth of evidence for each exon 
in the prediction. Internal exons were consid- 
ered to be supported if they were covered by 
homology evidence to within ±10 bases of 
their edges. For first and last exons, the internal 
edge was required to be within 10 bases, but the 
external edge was allowed greater latitude to 
allow for 5' and 3' untranslated regions 
(UTRs). To be retained, a prediction for a 
multi-exon gene must have evidence such that 
the total number of "hits," as defined above, 
divided by the number of exons in the predic- 
tion must be >0.66 or must correspond to a 
RefSeq sequence. A single-exon gene must be 
covered by at least three supporting hits (±10 
bases on each side), and these must cover the 
complete predicted open reading frame. For 
a single-exon gene, we also required that 
the Genscan prediction include both a start 
and a stop codon. Gene models that did not 
meet these criteria were disregarded, and 

Table 7. Sensitivity and specificity of Otto and 
Genscan. Sensitivity and specificity were calculat- 
ed by first aligning the prediction to the published 
RefSeq transcript, tallying the number (N) of 
uniquely aligned RefSeq bases. Sensitivity is the 
ratio of N to the length of the published RefSeq 
transcript. Specificity is the ratio of N to the 
length of the prediction. All differences are signif- 
icant (Tukey HSD; P < 0.001). 



Method 


Sensitivity 


Specificity 


Otto (RefSeq only)* 


0.939 


0.973 


Otto (homology)t 


0.604 


0.884 


Genscan 


0.501 


0.633 



•Refers to those annotations produced by Otto using only 
the Sim4-potished RefSeq alignment rather than an evi- 
dence-based Genscan prediction. t Refers to those 
annotations produced by supplying all available evidence 
to Genscan. 



those that passed were promoted to Otto 
predictions. Homology-based Otto predic- 
tions do not contain 3' and 5' untranslated 
sequence. Although three de novo gene-finding 
programs [GRAIL, Genscan, and FgenesH 
(63)] were run as part of the computational 
analysis, the results of these programs were not 
directly used in making the Otto predictions. 
Otto predicted 11,226 additional genes by 
means of sequence similarity. 

3.2 Otto validation 

To validate the Otto homology-based process 
and the method that Otto uses to define the 
structures of known genes, we compared tran- 
scripts predicted by Otto with their correspond- 
ing (and presumably correct) transcript from a 
set of 4512 RefSeq transcripts for which there 
was a unique SIM4 alignment (Table 7). In 
order to evaluate the relative performance of 
Otto and Genscan, we made three comparisons. 
The first involved a determination of the accu- 
racy of gene models predicted by Otto with 
only homology data other than the correspond- 
ing RefSeq sequence (Otto homology in Table 
7). We measured the sensitivity (correctly pre- 
dicted bases divided by the total length of the 
cDNA) and specificity (correctly predicted 
bases divided by the sum of the correctly and 
incorrectly predicted bases). Second, we exam- 
ined the sensitivity and specificity of the Otto 
predictions that were made solely with the Ref- 
Seq sequence, which is the process that Otto . 
uses to annotate known genes (Otto-RefSeq). 
And third, we deterrnined the accuracy of the 
Genscan predictions corresponding to these 
RefSeq sequences. As expected, the alignment 
method (Otto-RefSeq) was the most accurate, 
and Otto-homology performed better than Gen- 
scan by both criteria. Thus, 6. 1% of true RefSeq 
nucleotides were not represented in the Otto- 
refseq annotations and 2.7% of the nucleotides 
in the Otto-RefSeq transcripts were not con- 
tained in the original RefSeq transcripts. The 
discrepancies could come from legitimate 
differences between the Celera assembly 
and the RefSeq transcript due to polymor- 
phisms, incomplete or incorrect data in the 
Celera assembly, errors introduced by Sim4 
during the alignment process, or the pres- 
ence of alternatively spliced forms in the 
data set used for the comparisons. 

Because Otto uses an evidence-based ap- 
proach to reconstruct genes, the absence of 
experimental evidence for intervening exons. 
may inadvertantly result in a set of exons that 
cannot be spliced together to give rise to a 
transcript In such cases, Otto may "split genes" 
when in fact all the evidence should be com- 
bined into a single transcript We also examined 
the tendency of these methods to incorrectly 
split gene predictions. These trends are shown 
in Fig. 8. Both RefSeq and homology-based 
predictions by Otto split known genes into few- 
er segments than Genscan alone. 



3.3 Gene number 

Recognizing that the Otto system is quite 
conservative, we used a different gene-pre- 
diction strategy in regions where the ho- 
mology evidence was less strong. Here the 
results of de novo gene, predictions were 
used. For these genes, we insisted that a 
predicted transcript have at least two of the 
following types of evidence to be included 
in the gene set for further analysis: protein, 
human EST, rodent EST, or mouse genome 
fragment matches. This final class of pre- 
dicted genes is a subset of the predictions 
made by the three gene-finding programs 
that were used in the computational pipe- - 
line. For these, there , was not sufficient 
. sequence similarity information for Otto to 
attempt to predict a gene structure. The 
three de novo gene-finding programs re- 
sulted in about 155,695 predictions, of 
which —76,410 were nonredundant (non- 
overlapping with one another). Of these, 
57,935 did not overlap known genes or 
predictions made by Otto. Only 21,350 of 
the gene predictions that did not overlap 
Otto predictions were partially supported 
by at least one type of sequence similarity 
evidence, and 8619 were partially support- 
ed by two types of evidence (Table 8). 

The sum of this number (21,350) and the 
number of Otto annotations (17,764), 39,1 14, 
is near the upper limit for the human gene 
complement. As seen in Table 8, if the re- 
quirement for other . supporting evidence is 
made more stringent, this number drops rap- 
idly so that demanding two types of evidence 
reduces the total gene number to 26,383 and 
demanding three types reduces it to —23,000. 
Requiring that a prediction be supported by 
all four categories of evidence is too stringent 
because it would eliminate genes that encode 
novel proteins (members of currently unde- 
scribed protein families). No correction for 
pseudogenes has been made at this point in 
the analysis. 

In a further attempt to identify genes that 
were not found by the autoannotation process 
or any of the de novo gene finders, we ex- 
amined regions outside of gene predictions 
that were similar to the EST sequence, and 
where the EST matched the genomic se- 
quence across a splice junction. After correct- 
ing for potential 3' UTRs of predicted genes, 
about 2500 such regions remained. Addition 
of a requirement for at least one of the fol- 
lowing evidence, types— homology to mouse 
genomic sequence fragments, rodent ESTs, 
or cDNAs — or similarity to a known protein 
reduced this number to 1010. Adding this to 
the numbers from the previous paragraph 
would give us estimates of about 40,000, 
27,000, and 24,000 potential genes in the 
human genome, depending on the stringency 
of evidence considered. Table 8 illustrates the 
number of genes and presents the degree of 
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confidence based on the supporting evidence. 
Transcripts encoded by a set of 26,383 genes 
were assembled for further analysis. This set 
includes the 6538 genes predicted by Otto on 
the basis of matches to known genes, 1 1,226 
transcripts predicted by Otto based on homol- 
ogy evidence, and 8619 from the subset of 
transcripts from de novo gene-prediction pro- 
grams that have two types of supporting ev- 
. idence. The 26,383 genes are illustrated along 
. chromosome diagrams in Fig; 1 . These are a 
• very preliminary set of annotations arid are 
subject to all the limitations of an automated 
process. Considerable refinement is still nec- 
essary to improve the accuracy of these tran- 
script predictions. All the predictions and 
descriptions of genes and the associated evi- 
dence that we present are the product of 
completely computational processes, not ex- 
pert curation. We have attempted to enumer- 
ate the genes in the human genome in such a 
way that we have different levels of confi- 
dence based on the amount of supporting 
evidence: known genes, genes with good pro- 
tein or EST homology evidence, and de novo 
gene predictions confirmed by modest ho- 
mology evidence. 

3.4 Features of human gene 
transcripts 

We estimate the average span for a "typi- 
cal" gene in the human DNA sequence to 
be about 27,894 bases. This is based on the 
average span covered by RefSeq tran- 
scripts, used because it represents our high- 
est confidence set. 

The set of transcripts promoted to gene 
annotations varies in a number of ways. As 
can be seen from Table 8 and Fig. 9, tran- 
scripts predicted by Otto tend to be longer, 
having on average about 7.8 exons, whereas 
those promoted from gene-prediction pro- 
grams average about 3.7 exons. The largest 
number of exons that we have identified in a 
transcript is 234 in the titin mRNA. Table 8 
compares the amounts of evidence that sup- 



port the Otto and other predicted transcripts. 
For example, one can see that a typical Otto 
transcript has 6.99 of its 7.81 exons supported 
by protein homology evidence. As would be 
expected, the Otto transcripts generally have ' 
more support than do transcripts predicted by 
the de novo methods. 

4 Genome Structure 

Summary. This section describes several of 
the rioncoding attributes of the assembled . 
genome sequence and their correlations with 
the predicted gene set. These include an anal- 
ysis of G+C content and gene density in the 
context of cytogenetic maps of the genome, 
an enumerative analysis of CpG islands, and 
a brief description of the genome-wide repet- 
itive elements. 
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4.1 Cytogenetic maps 

Perhaps the most obvious, and certainly the 
: most, visible, element of the structure of 
the genome is the banding pattern produced 
by Giemsa stain. Chromosomal banding 
studies have revealed that about 17% to 
20% of the human chromosome comple- 
ment consists of C-bands, or constitutive 
heterochromatin (64). Much of this hetero- 
chromatin is highly polymorphic and con- 
sists of different families of alpha satellite 
DNAs with various higher, order repeat 
structures (65). Many chromosomes have 
complex inter- and intrachromosomal du- 
plications present in pericentromeric re- 
gions (66). About 5% of the sequence reads 
were identified as alpha satellite sequences; 
these were not included in the assembly. 



■ Otto (homology) 

□ Otto (RefSeq only) 

□ Genscan y 



rfl n n n 



8 9 10 11 12 13 14 15 16 17 



Number of predictions per RefSeq transcript 

Fig. 8. Analysts of split genes resulting from different annotation methods. A set of 4512 
Sim4-based alignments of RefSeq transcripts to the genomic assembly were chosen (see the text 
for criteria), and the numbers of overlapping Genscan, Otto (RefSeq only) annotations based solely 
on Sim4-polished RefSeq alignments, and Otto (homology) annotations (annotations produced by 
supplying all available evidence to Genscan) were tallied. These data. show the degree to which 
multiple Genscan predictions and/or Otto annotations were associated with a single RefSeq 
transcript. The zero class for the Otto-homology predictions shown here Indicates that the 
Otto-homology calls were made without recourse to the RefSeq transcript, and thus no Otto call 
was made because of insufficient evidence. 



Table 8. Numbers of exons and transcripts supported by various types of evidence for Otto and de novo gene prediction methods. Highlighted cells indicate 
the gene sets analyzed in this paper (boldface, set of genes selected for protein analysis; italic, totat set of accepted de novo predictions). • 







Total 




Types of evidence 






No. of lines of evidence* 








Mouse 


Rodent 


Protein 


Human 


SI 


S2 


5:3 


S4 


Otto 


Number of 


17,969 


17,065 


14,881 


15,477 


16,374 


17,968f 


17,501 


15,877 


12,451 




transcripts 
Number of 


141,218 


111,174 


89.569 


108,431 


118,869 


140,710 


127,955 


99,574 


59,804 


De novo 


exons 
Number of 


58,032 


14,463 


5,094 


8,043 


9.220 


ZhSSO 


8,619 


4,947 


1,904 




transcripts . 
Number of 


319,935 


48,594 


19,344 


26.264 


■ 40,104 


79,148 


31.130 


17,508 


6,520 


No. of exons per 
transcript 


exons 
Otto 
De novo 


7.84 
5.53 


5.77 
3.17 


6.01 
3.80 


6.99 
3.27 


7.24 
4.36 


7.81 
3.7 


7.19 
3.56 


6.00 
3.42 


4.28 
3.16 



•Four kinds of evidence (conservation In 3X mouse genomic DNA. similarity to human EST or cONA. similarity to rodent EST or cDNA, and similarity to known proteins) were 
considered to support gene predictions from the different methods. The use of evidence is quite liberal, requiring only a partial match to a single exon of predicted transcript tThis 
number includes alternative splice forms of the 17,764 genes mentioned elsewhere In the text 
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Examination of pericentromeric regions is 
ongoing. 

The remaining ~80% of the genome, the 
euchromatic component, is divisible into G-, 
R-, and T-bands (57). These cytogenetic bands 
have been presumed to differ in their nucleotide 
composition and gene density, although we 
have been unable to determine precise band 
boundaries at the molecular level. T-bands are 
the most G+C- and gene-rich, and G-bands are 
G+C-poor (68). Bemardi has also offered a . 
description of the euchromatin at the molecular . 
level as long stretches of DNA of differing base 
composition, termed isochores (denoted L, HI, 
H2, and H3), which are >300 kbp in length 
(69). Bemardi defined the L (light) isochores as . 
G+C-poor (<43%), whereas the H (heavy) 
isochores fall into three G+C-rich classes rep- 
resenting 24, 8, and 5% of the genome. Gene 
concentration has been claimed to be very low 
in the L isochores and 20-fold more enriched in 
the H2 and H3 isochores (70). By examining 
contiguous 50-kbp windows of G+C content 
across the assembly, we found that regions of 
G+C content >48% (H3 isochores) averaged 
273.9 kbp in length, those with G+C content 
between 43 and 48% (HI +H2 isochores) aver- 
aged 202.8 kbp in length, and the average span 
of regions with <43% (L isochores) was 
1078.6 kbp. The correlation between G+C 
content and gene density, was also examined in 
50-kbp windows along the assembled sequence . 
(Table 9 and Figs. 10 and 11). We found that 
the density of genes was greater in regions of 
high G+C than in regions of low G+C content, 
as expected. However, the correlation between 
G+C content and gene density was not as 
skewed as previously predicted (69). A higher 
proportion of genes were located in the G+C- 
poor regions than had been expected. 

Chromosomes 17, 19, and 22, which have 
a disproportionate number of H3-containing 
bands, had the highest gene density (Table 
10). Conversely, of the chromosomes that we 



found to have the lowest gene density, X, 4, 
i 8, 13, and Y, also have the fewest H3 bands. 
Chromosome 15, which also has few H3 
bands, did not have a particularly low gene 
density in* our analysis. In addition, chromo- 
some 8, which we found to have a low gene 
density, does not appear to be unusual in its 
H3 banding. 

. How. valid is Ohno's postulate (71) that . 
mammalian genomes consist of oases of genes 
in otherwise essentially empty deserts? It ap- 

. pears that the human genome does indeed con- 
tain deserts, or large, gene-poor regions. If we 
define a desert as a region >500 kbp without a 

. gene, then we see that 605 Mbp, or about 20% 
of .the . genome, is in deserts. These are not 
uniformly distributed over the various chromo- 
somes. Gene-rich chromosomes 17, 19, and 22 
have only about 12% of their collective 171 
Mbp in deserts, whereas gene-poor chromo- 
somes 4, 13, 18, and X have 27.5% of their 492 
Mbp in deserts (Table 1 1). The apparent lack of 
predicted genes in these regions does not nec- 
essarily imply that they are devoid of biological 
function. 

4.2 Linkage map 

Linkage maps provide the basis for genetic 
analysis and are widely used in the study of the 
inheritance of traits and in the positional clon- 
ing of genes. Hie distance metric, centimorgans 
(cM), is based on the recombination rate be- . 
tween homologous chromosomes during meio- 

Table 9. Characteristics of G+C in isochores. 



sis. In general, the rate of recombination in 
females is greater than that in males, and this 
degree of map expansion is not uniform across 
the genome (72). One of the opportunities en- 
abled by a nearly complete genome sequence is 
to produce the ultimate physical map, and to 
fully analyze its correspondence with two other 
maps that have been widely used in genome 
and genetic analysis: toe .linkage map and the 
cytogenetic map. This would close the loop 
between the mapping and sequencing phases of 
the genome project. 

We mapped the location of the markers 
that constitute the Genethon linkage map to 
the genome. The rate of recombination, ex- 
pressed as cM per Mbp, was calculated for 
3 -Mbp windows as shown in Table 12. High- 
er rates of recombination in the telomeric 
region of the chromosomes have been previ- 
ously documented (73). From this mapping 
result, there is a difference of 4.99 between 
lowest rates and highest rates ana* the largest 
difference of 4.4 between males and females 
(4.99 to 0.47 on chromosome 16). This indi- 
cates that the variability in recombination 
rates among regions of the genome exceeds 
the differences in recombination rates be- 
tween males and females. The human ge- 
nome has recombination hotspots, where re- 
combination rates vary fivefold or more over 
a space of 1 kbp, so the picture one gets of the 
magnitude of variability in recombination 
rate will depend on the size of the window 



Isochore 



G+C (%) 



Fraction of genome 



Fraction of genes 



Predicted* 



Observed 



Predicted* 



Observed 



H3 

H1/H2 
L 



>48 
43-48 
<43 



5 
25 
67 



9.5 
21.2 
69.2 



37 
32 
31 



24.8 
26.6 
48.5 



♦The predictions were based on Bernard's definitions (70) of the isochore structure of the human genome. 



Fig. 9. Comparison of 
the number of exons 
per transcript between 
the 17,968 Otto tran- 
scripts and 21,350 de 
novo transcript predic- 
tions with at least one 
line of evidence that 
do not overlap with an 
Otto prediction. Both 
sets have the highest 
number of transcripts 
in the two-exon cate- 
gory, but the de novo 
gene predictions are 
skewed much more 
toward smaller tran- 
scripts. In the Otto set, 
19.7% of the tran- 
scripts have one or 
two exons, and 5.7% 
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examined. Unfortunately, too few meiotic 
crossovers have occurred in Centre d'Etude 
du Polymorphism Humain (CEPH) and other 
reference families to provide a resolution any 
ner than about 3 Mbp. The next challenge 
ill be to detennine a sequence basis of 
recombination at the chromosomal level. An 
accurate predictor for the rate for variation in 
recombination rates between any pair of 
markers would be extremely useful in design- 
ing markers to narrow a region of linkage, 
such as in positional cloning projects.. 

4.3 Correlation between CpG islands 
and genes 

CpG islands are stretches of unmethylated 
DNA with a higher frequency of CpG 
dinucleotides when compared with the entire 
genome (74). CpG islands are believed to 
preferentially occur at the transcriptional start 
of genes, and it has been observed that most 
housekeeping genes have CpG islands at the 
5' end of the transcript (75, 76). In addition, 
experimental evidence indicates that CpG is- 
land methylation is correlated with gene in- 
activation (77) and has been shown to be 
important during gene imprinting (78) and 
tissue-specific gene expression (79) 

Experimental methods have been used 
that resulted in an estimate of 30,000 to 
45,000 CpG islands in the human genome 
(74, 80) and an estimate of 499 CpG islands 
rm human chromosome 22 (81). Larsen et 
(76) and Gardiner-Garden and Frommer 
5) used a computational method to iden- 
tify CpG islands and defined them as re- 
gions of DNA of >200 bp that have a G + C 
content of >50% and a ratio of observed 



^^>n 



the Human genome 

versus expected frequency of CG dinucle- 
otide 2=0.6. 

It is difficult to make a direct compari- 
son of . experimental definitions of CpG is- 
lands with computational definitions be- 
cause computational methods do not con- 
sider the methylation state of cytosine and 
experimental methods do not directly select 
regions of high G+C content. However, we 
can determine the correlation of CpG island . 
with gene ;; starts, given a set of annotated < 
genomic transcripts arid the whole genome 
sequence. We have analyzed the publicly 
available annotation of chromosome 22, as 
well as using the entire human genome in 
our assembly and the computationally an- 
notated genes. A variation of the CpG is- 
land computation was compared with 
Larsen et ah (76). The main differences are 
that we use a sliding window of 200 bp, 
consecutive windows are merged only if 
they overlap, and we recompute the CpG 
value upon merging, thus rejecting any po- 
tential island if it scores less than the 
threshold. 

To compute various CpG statistics, we 
used two different thresholds of CG dinucle- 
otide likelihood ratio. Besides using the orig- 
inal threshold of 0.6 (method 1), we used a 
higher threshold of CG dinucleotide likeli- 
hood ratio of 0.8 (method 2), which results in 
the number of CpG islands on chromosome 
22 close to the number of annotated genes on 
this chromosome. The main results are sum- 
marized in Table 13. CpG islands computed 
with method 1 predicted only 2.6% of the 
CSA sequence as CpG, but 40% of the gene 
starts (start codons) are contained inside a 



■ % of genome 
□ % of genes 




30-35% 35-40% 40-45% 45-50% 50-55% 55-60% 60-65% 



Fig. 10. Relation between G+C content and gene density. The blue bars show the percent of the 
nome (in 50-kbp windows) with the indicated G+C content. The percent of the total number of 
nes associated with each G+C bin is represented by the yellow bars. The graph shows that about 
% of the genome has a C+C content of between 50 and 55%, but that this portion contains 
nearly 15% of the genes. 



CpG island. This is comparable to ratios re- 
ported by others (82). The last two rows of 
the table show the observed and expected 
average distance, respectively, of the closest 
CpG island from the first exon. The observed 
average closest CpG islands are smaller than 
the corresponding expected distances, con- 
firming an association between CpG island 
and the first exon. .. . 

: We also looked at the distribution of CpG 
telahd nucleotides among various sequence 
classes such as intergenic region's, introns, 
exons, and first exons. We computed the 
likelihood score for each sequence class as 
the ratio of the observed fraction of CpG 
island nucleotides in that sequence class 
and the expected fraction of CpG island 
nucleotides in that sequence class. The re- 
sult of applying method 1 on CSA were 
scores of 0.89 for intergenic region, 1.2 for 
intron, 5.86 for exon, and 13.2 for first 
exon. The same trend was also found for 
chromosome 22 and after the application of 
a higher threshold (method 2) on both data 
sets. In sum, genome-wide analysis has 
extended earlier analysis and suggests a 
strong correlation between CpG islands and 
first coding exons. 

4.4 Genome-wide repetitive elements 
The proportion of the genome covered by 
various classes of repetitive DNA is present- 
ed in Table 14. We observed about 35% of 
the genome in these repeat classes, very sim- 
ilar to values reported previously (83). Repet- 
itive sequence may be underrepresented in 
the Celera assembly as a result of incomplete 
repeat resolution, as discussed above. About 
8% of the scaffold length is in gaps, and we 
expect that much of this is repetitive se- 
quence. Chromosome 19 has the highest re- 
peat density (57%), as well as the highest 
gene density (Table 10). Of interest, among 
the different classes of repeat elements, we 
observe a clear association of Alu elements 
and gene density, which was not observed 
between LINEs and gene density. 

5 Genome Evolution 

Summary. The dynamic nature of genome 
evolution can be captured at several levels. 
These include gene duplications mediated by 
RNA intermediates (retrotransposition) and 
segmental genomic duplications. In this sec- 
tion, we document the genome-wide occur- 
rence of retrotransposition events generating 
functional (intronless paralogs) or inactive 
genes (pseudogenes). Genes involved in 
translational processes and nuclear regulation 
account for nearly 50% of all intronless para- 
logs and processed pseudogenes detected in 
our survey. We have also cataloged the extent 
of segmental genomic duplication and pro- 
vide evidence for 1077 duplicated blocks 
covering 3522 distinct genes. 
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Fig. 11 (continued). Relation among gene density (orange), G+C content 
(green), EST density (blue), and Alu density (pink) along the lengths of 
each of the chromosomes. Gene density was calculated in 1-Mbp win- 
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dows. The percent of G+C nucleotides was calculated in 100 
windows. The number of ESTs and Alu elements is shown per 100 
window. 
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5.1 Retrotransposition in the human 
genome 

Retrotransposition of processed mRNA 
transcripts into the genome results in func- 
tional genes, called intronless paralogs, or 
reactivated genes (pseudogenes). A paralog 
jers to a gene that appears in more than 
e copy in a given organism as a result of 



Jna< 

• 



a duplication event. The existence of both 
intron-containing and intronless forms of 
genes encoding functionally similar or 
identical proteins has been previously de- 
scribed {84, 85). Cataloging these evolu- 
tionary events on the genomic landscape is 
of value in understanding the functional 
consequences of such gene-duplication 



events in cellular biology. Identification of 
conserved intronless paralogs in the mouse 
or other mammalian genomes should pro- 
vide the basis for capturing the evolution- 
ary chronology of these transposition 
events and provide insights into gene loss 
and accretion in the mammalian radiation. 
A set of proteins corresponding to all 901 
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Otto-predicted, single-exon genes were sub- 
jected to BLAST analysis against the proteins 
encoded by the remaining multiexon predict- 
ed transcripts. Using homology criteria of 
0% sequence identity over 90% of the 
ength, we identified 298 instances of single- 
to multi-exon correspondence. Of these 298 
sequences, 97 were represented in the Gen- 
Bank data set of experimentally validated 
full-length genes at the stringency specified 
and were verified by manual inspection. ; . 

; We believe. that these 97 cases may rep- 
resent intronless paralogs (see Web table 1 on 
Science Online at www.sciencemag.org/cgi/ 
content/full/291/5507/1304/DCl) of known 
genes. Most of these are flanked by direct 
repeat sequences, although the precise nature 
of these repeats remains to be determined. All 
of the cases for which we have high confi- 
dence contain polyadenylated [poly(A)] tails 
characteristic of retrotransposition. 

Recent publications describing the phe- 
nomenon of functional intronless paralogs 
speculate that retrotransposition may serve as 
a mechanism used to escape X-chromosomal 
inactivation (6V, 86). We do not find a bias 
toward X chromosome origination of these 
retrotransposed genes; rather, the results 
show a random chromosome distribution of 
both the intron-containing and corresponding 
intronless paralogs. We also have found sev- 
eral cases of retrotransposition from a single 
source chromosome to multiple target chro- 
osomes. Interesting examples include the 
etrotransposition of a five exon-containing 
ribosomal protein L21 gene on chromosome 
13 onto chromosomes 1, 3, 4, 7, 10, and 14, 
respectively. The size of the source genes can 
also show variability. The largest example is 
the 31-exon diacylglycerol kinase zeta gene 
on chromosome 11 that has an intronless 
paralog on chromosome 13. Regardless of 
route, retrotransposition with subsequent 
gene changes in coding or noncoding regions 
that lead to different functions or expression 
patterns, represents a key route to providing 
an enhanced functional repertoire in mam- 
mals (8 7). 

Our preliminary set of retrotransposed in- 
tronless paralogs contains a clear overrepre- 
sentatiori of genes involved in translational 
processes (40% ribosomal proteins and 10% 
translation elongation factors) and nuclear 
regulation (HMG nonhistone proteins, 4%), 
as well as metabolic and regulatory enzymes. 
EST matches specific to a subset of intronless 
paralogs suggest expression of these intron- 
less paralogs. Differences in the upstream 
regulatory sequences between the source 
genes and their intronless paralogs could ac- 
count for differences in tissue r specific gene 
expression. Defining which, if any, of these 
rocessed genes are functionally expressed 
r and translated will require further elucidation 
and experimental validation. 



THE HUMAN GENOME 
5.2 Pseudogenes 

A pseudogene is a nonfunctional copy that is 
very similar to a normal gene but that has 
been altered slightly so that it is not ex- 
Table 11. Genome overview. 



pressed. We developed a method for the pre- 
liminary analysis of processed pseudogenes 
in the human genome as a starting point in 
elucidating the ongoing evolutionary forces 



Size of the genome (including gaps) 
Size of the genome (excluding gaps) 
Longest contig ' 
- Longest scaffold 

Percent of A -fT in the genome ; 
Percent of G+C in the genome 
Percent of undetermined bases in the genome 
Most GC-rich 50 kb 
Least CC-rich 50 kb 
Percent of genome classified as repeats 
Number of annotated genes 
Percent of annotated genes with unknown function 
Number of genes (hypothetical and annotated) 
Percent of hypothetical and annotated genes with unknown function 
Gene with the most exons 
Average gene size 
Most gene-rich chromosome 
Least gene-rich chromosomes 

Total size of gene deserts (>500 kb with no annotated genes) 
Percent of base pairs spanned by genes 
Percent of base pairs spanned by exons 
Percent of base pairs spanned by introns 
Percent of base pairs in intergentc DNA 

Chromosome with highest proportion of DNA in annotated exons 
Chromosome with lowest proportion of DNA in annotated exons 
Longest intergentc region (between annotated + hypothetical genes) 
Rate of SNP variation 

,*ln these ranges, the percentages correspond to the annotated gene set (26, 383 genes) and the hypothetical + 
annotated gene set (39,114 genes), respectively. 



2.91 Cbp 
2.66 .Gbp 

1.99 Mop * . 

14.4 Mbp 
54' 

38 
9 

Chr. 2 (66%) 
Chr. X (25%) 
35 

26,383 
42 

39,114 
59 

Titin (234 exons) 
27 kbp 

Chr. 19 (23 genes/Mb) 
Chr. 13 (5 genes/Mb), 
Chr. Y (5 genes/Mb) 
605 Mbp s 

25.5 to 37.8* 
1.1 to 1.4* 

24.4 to 36.4* 

74.5 to 63.6* 
Chr. 19 (9.33) 
Chr. Y (0.36) 

Chr. 13 (3,038,416 bp) 
1/1250 bp 



Table 12. Rate of recombination per physical distance (cM/Mb) across the genome. Genethon markers 
were placed on CSA-mapped assemblies, and then relative physical distances and rates were calculated 
in 3-Mb windows for each chromosome. NA, not applicable. 



Male 



Chrom. 



Sex-average 



Female 





Max. 


Avg. 


Min. 


Max. 


Avg. 


Min. 


Max. 


Avg. 


Min. 


1 


2.60 


1.12 


0.23 


2.81 


1.42 


0.52 


3.39 


1.76 


0.68 


2 


2.23 


0.78 


0,33 


2.65 


1.12 


0.54 


3.17 


1.40 


0.61 


3 


2.55 


0.86 


0.23 


2.40 


1.07 


0.42 


2.71 


1.30 


0.33 


4 


1.66 


0.67 


0.15 


2.06 


1.04 


0.60 


2.50 


1.40 


0.77 


5 


2.00 


0.67 


0.18 


1.87 


1.08 


0.42 


2.26 


1.43 


0.62 


6 


1.97 


0.71 


0.28 


2.57 


1.12 


0.37 


3.47 


1.67 


0.64 


7 


2.34 


1.16 


0.48 


1.67 


1.17 


0.47 


2.27 


1.21 


0.34 


8 


. 1.83 


0.73 


0.14 


2.40 


1.05 


0.46 


3.44 . 


• 1.36 


0.43 


9 


2.01 


0.99 


0.53 


1.95 


1.32 


0.77 


2.63 


'1.66 


0.82 


10 


.3.73 


1.03 


0.22 


3.05 


1.29 


. 0.66 


2.84 


1.51 


0.76 


11 


1.43 


0.72 


0.31 


2.13 


0.99 


0.47 


3.10 


1.32 


0.49 


12 


4.12 


0.76 


0.26 


3.35 


1.16 


0.49 


2.93 


1.55 


0.59 


13 


1.60 


0.75 


0.01 


1.87 


0.9S 


0.17 


2.49 


1.19 


0.32 


14 


3.15 


0.98 


0.18 


2.65 


1.30 


0.62 


3.14 


1.63 


0.75 


15 


2.28 


0.94 


0.34 


2.31 


1.22 


0.42 


2.53 


1.56 


0.54 
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1.83 


1.00 


0.47 
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0.63 


4.99 
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0.87 
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1.41 


0.49 
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21 
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1.62 
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1.88 
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1.08 
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NA 


NA 
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NA 


NA 
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NA 


NA 


NA 


NA 


NA 


NA 


NA 


NA 
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4.12 
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0.00 


3.75 


1.22 


0.17 


4.99 


1.55 


0.32 
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. that account for gene inactivation. The gen- 
eral structural characteristics of these pro- 
cessed pseudogenes include the complete 
lack of intervening sequences found in the 
functional counterparts, a poly(A) tract at the 
3' end, and direct repeats flanking the pseu- 
dogene sequence. Processed pseudogenes oc- 
cur as a result of retrotransposition, whereas 
unprocessed pseudogenes arise from segmen- 
tal genome duplication. 

We searched the complete set of Otto- 
predicted transcripts against the genomic, se- 
. quence by means of BLAST. Genomic re- 
gions corresponding to all Otto-predicted , 
transcripts were excluded from this analysis. 
We identified 2909 regions matching with 
greater than 70% identity over at least 70% of 
the length of the transcripts that likely repre- 
sent processed pseudogenes. This number is 
probably an underestimate because specific 
methods to search for pseudogenes were not 
used. 

We looked for correlations between 
structural elements and the propensity for 
retrotransposition in the human genome. 
GC content and transcript length were com- 
pared between the genes with processed 



pseudogenes (1177 source genes) versus 
the remainder of the predicted gene set. 
Transcripts that give rise to processed pseu- 
dogenes have shorter average transcript 
length (1027 bp versus 1594 bp for the Otto 
set) as compared with genes for which no 
pseudogene was detected. The overall GC 
. content did not show any significant differ- 
ence, contrary to a recent report (88). There 
is a clear trend in gene families that are 
present as processed pseudogenes. These 
include ribosomal proteins (67%), lamin 
receptors (10%), translation elongation fac- 
tor alpha (5%), and HMG-non-histone pro- 
teins (2%). The increased occurrence of 
retrotransposition (both intronless paralogs 
and processed pseudogenes) among genes 
involved in translation and nuclear regula- 
tion may reflect an increased transcription- 
al activity of these genes. 

5.3 Gene duplication in the human 
genome 

Building on a previously published procedure 
(27), we developed a graph-theoretic algo- 
rithm, called Lek, for grouping the predicted 
human protein set into protein families (89). 



Table 13. Characteristics of CpG islands identified in chromosome 22 (34-Mbp sequence length) and the 
whole genome (2.9-Gbp sequence length) by means, of two different methods. Method 1 uses a CG 
likelihood ratio of £0.6. Method 2 uses a CG likelihood ratio of £0.8. 

Chromosome 22 Whole genome 

(CS assembly) 



Method 1 Method 2 Method 1 Method 2 



Number of CpG islands 5,211 
detected 

Average length of island (bp) 390 
Percent of sequence 5.9 

predicted as CpG 
Percent of first exons that 44 

overlap a CpG island 
Percent of first exons with 37 

first position of exon 

contained inside a CpG 

island 

Average distance between 1,013 
first exon and closest CpG 
island (bp) 

Expected distance between 3,262 
first exon and closest CpG 
island (bp) 



522 195.706 26,876 

535 395 497 

0.8 2.6 0.4 

25 42 22 

22 40 21 

10,486 2,182 17,021 

32,567 7.164 55,811 



Table 14. Distribution of repetitive DNA in the compartmentalized shotgun assembly sequence. . 



Megabases in Percent Previously 

Repetitive elements assembled of predicted 

sequences assembly (%) (83) 



Alu 288 9.9 10.0 

Mammalian interspersed repeat (MIR) 66 2.3 1.7 

Medium reiteration (MER) 50 . 1.7 1.6 

Long terminal repeat (LTR) 155 5.3 5.6 

Long interspersed nucleotide element 466 16.1 16.7 
(LINE) 

Total 1025 35.3 35.6 



The complete clusters that result from the 
Lek clustering provide one basis for compar- 
ing the role of whole-genome or chromosom- 
al duplication in protein family expansion as 
opposed to other means, such as tandem du- 
plication. Because each complete cluster rep- 
resents a closed and certain island of homol- 
ogy, and because Lek is capable of simulta- 
neously clustering protein complements of 
several organisms, the number of proteins 
contributed by each organism to a complete 
. cluster can be predicted with confidence de- 
pending on the quality of the annotation of 
each genome. The variance of each organ- 
ism's contribution to each cluster can then be 
calculated, allowing an assessment of the rel- 
ative importance of large-scale duplication 
versus smaller-scale, organism-specific ex- 
pansion and contraction of protein families, 
presumably as a result of natural selection 
operating on individual protein families with- 
in an organism. As can be seen in Fig. 12, the 
large variance in the relative numbers of hu- 
man as compared with D. melanogaster and 
Caenorhabditis elegans proteins in complete 
clusters may be explained by multiple events 
of relative expansions in gene families in 
each of the three animal genomes. Such ex- 
pansions would give rise to the distribution 
that shows a peak at 1:1 in the ratio for 
human-worm or human-fly clusters with the 
slope spread covering both human and fly/ 
worm predominance, as we observed (Fig. 
12). Furthermore, there are nearly as many 
clusters where worm and fly proteins pre- 
dominate despite the larger numbers of pro- 
teins in the human. At face value, this anal- 
ysis suggests that natural selection acting on 
individual protein families has been a major 
force driving the expansion of at least some 
elements of the human protein set. However, 
in our analysis, the difference between an 
ancient whole-genome duplication followed 
by loss, versus piecemeal duplication, cannot 
be easily distinguished. In order to differen- 
tiate these scenarios, more extended analyses 
were performed. 

5.4 Large-scale duplications 

Using two independent methods, we 
searched for large-scale duplications in the 
human genome. First, we describe a protein 
family-based method that identified highly * 
conserved blocks of duplication. We then 
describe our comprehensive method for identi- 
fying all interchromosomal block duplications. 
The latter method identified a large number of 
duplicated chromosomal segments covering 
parts of all 24 chromosomes. 

The first of the methods is based on the 
idea of searching for blocks of highly con- 
served homologous proteins that occur in 
more than one location on the genome. For 
this comparison, two genes were considered 
equivalent if their protein products were de- 
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termined to be in the same family and the 
same complete Lek cluster (essentially 
paralogous genes) (89). Initially, each chro- 
osome was represented as a string of genes 
dered by the start codons for predicted 
%nes along the chromosome. We considered 
the two strands as a single string, because 
local inversions are relatively common events 
relative to large-scale duplications. Each 
gene was indexed according to the protein 
family and Lek complete cluster (89). All 
pairs of . indexed gene . strings . were men 
aligned in both the forward and reverse di- 
rections with the Smith- Waterman algorithm 
(90). A match between two proteins of the 
same Lek complete cluster was given a score 
of 10 and a mismatch -10, with gap open 
and extend penalties of —4 and -1. With 
these parameters, 19 conserved interchromo- 
somal blocks of duplication were observed, 
all of which were also detected and expanded 
by the comprehensive method described be- 
low. The detection of only a relatively small 
number of block duplications was a conse- 
quence of using an intrinsically conservative 
method grounded in the conservative con- 
straints of the complete Lek clusters. 

In the second, more comprehensive ap- 
proach, we aligned all chromosomes directly 
with one another using an algorithm based on 
the MUMmer system (91). This alignment 
method uses a suffix tree data structure and a 
linear-time algorithm to align long sequences 
ry rapidly; for example, two chromosomes . 
f 100 Mbp can be aligned in less than 20 
min (on a Compaq Alpha computer) with 4 
gigabytes of memory. This procedure was 
used recently to identify numerous large- 
scale segmental duplications among the five 
chromosomes of A. thaliana (92); in that 
organism, the method revealed that 60% of 
the genome (66 Mbp) is covered by 24 very 
large duplicated segments. For Arabidopsis, a 
DNA-based alignment was sufficient to re- 
veal the segmental duplications between 
chromosomes; in the human genome, DNA 
alignments at the whole-chromosome level 
are insufficiently sensitive. Therefore, a mod- 
ified procedure was developed and applied, 
as follows. First, all 26,588 proteins 
(9,675,713 million amino acids) were concat- 
enated end-to-end in order as they occur 
along each of the 24 chromosomes, irrespec- 
tive of strand location. The concatenated pro- 
tein set was then aligned against each chro- 
mosome by the MUMmer algorithm. The 
resulting matches were clustered to extract all 
sets of three or more protein matches that 
occur in close proximity on two different 
chromosomes (93); these represent the can- 
didate segmental duplications. A series of 
.filters were developed and applied to remove 
likely false-positives from this set; for exam- 
ple, small blocks that were spread across 
many proteins were removed. To refine the 



filtering methods, a shuffled protein set was 
first created by taking the 26,588 proteins, 
randomizing their order, and then partitioning 
them into 24 shuffled chromosomes, each 
containing the same number of proteins as the 
true genome. This shuffled protein set has the 
identical composition to the real genome; in 
particular, every protein and every domain 
appears the same number of times. The com- 
plete algorithm was then applied to both the 
real and the shuffled data/with the results on 
the shuffled data being used to estimate the . 
false-positive rate. The algorithm after filter- 
ing yielded 10,310 gene pairs in 1077 dupli- 
cated blocks containing 3522 distinct genes; 
tandemly duplicated expansions in many of 
the blocks explain the excess of gene pairs to 
distinct genes. In the shuffled data, by con- 
trast, only 370 gene pairs were found, giving 
a false-positive estimate of 3.6%. The most 
likely explanation for the 1077 block dupli- 
cations is ancient segmental duplications. .In 
many cases, the order of the proteins has been 
shuffled, although proximity is preserved. 
Out of the 1077 blocks, 159 contain only 
three genes, 137 contain four genes, and 781 
contain five or more genes. 

To illustrate the extent of the detected 
duplications, Fig. 13 shows all 1077 block 
duplications indexed to each chromosome in 
24 panels in which only duplications mapped 
to the indexed chromosome are displayed. 
The figure makes it clear that the duplications 
are ubiquitous in the genome. One feature . 
that it displays is many relatively small chro- 
mosomal stretches, with one-to-many dupli- 
cation relationships that are graphically strik- 
ing. One such example captured by the anal- 
ysis is the well-documented olfactory recep- 
tor (OR) family, which is scattered in blocks 
throughout the genome and which has been 
analyzed for genome-deployment reconstruc- 
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tions at several evolutionary stages (94). The 
figure also illustrates that some chromo- 
somes, such as chromosome 2, contain many 
more detected large-scale duplications than 
others. Indeed, one of the largest duplicated 
segments is a large block of 33 proteins on 
chromosome 2, spread among eight smaller 
blocks in 2p, that aligns to a paralogous set on 
chromosome 14, with one rearrangement (see 
chromosomes 2 and .14 panels in Fig. 13). 
The proteins are not contiguous but span a 
region containing 97 proteins on chromo- 
some 2 and 332 proteins on chromosome 14. 
The likelihood of observing this many dupli- 
cated proteins by chance, even over a span of 
this length, is 2.3 X 10" 68 (93). This dupli- 
cated set spans 20 Mbp on chromosome 2 and 
63 Mbp on chromosome 14, over 70% of the 
latter chromosome. Chromosome 2 also con- 
tains a block duplication that is nearly as 
large, which is shared by chromosome arm 2q 
and chromosome 12. This duplication incor- 
porates two of the four known Hox gene 
clusters, but considerably expands the extent 
of the duplications proximally and distally on 
the pair of chromosome arms. This breadth of 
duplication is also seen on the two chromo- 
somes carrying the other two Hox clusters. 

An additional large duplication, between 
chromosomes 18 and 20, serves as a good 
example to illustrate some of the features 
common to many of the other observed large 
duplications (Fig. 13, inset): This duplication 
contains 64 detected ordered intrachromo- 
somal pairs of homologous genes. After dis- 
counting a 40-Mb stretch of chromosome 18 
free of matches to chromosome 20, which is 
likely to represent a large insert (between the 
gene assignments "Krup rel" and "collagen 
rel" on chromosome 18 in Fig. 13), the full 
duplication segment covers 36 Mb on chro- 
mosome 18 and 28 Mb on chromosome 20. 



Human/Worm 
Human/Fly 




5:1 4:1 3:1 
human predominant 



Ratio 



1:3 1:4 1:5 
fly/worm predominant 



Fig. 12. Gene duplication in complete protein clusters. The predicted protein sets of human, worm, 
and fly were subjected to Lek clustering (27). The numbers of clusters with varying ratios (whole 
number) of human versus worm and human versus fly proteins per cluster were plotted. 
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By this measure, the duplication segment 
spans nearly half of each chromosome's net 
length. The most likely scenario is that the 
whole span of this region was duplicated as a 
single very large block, followed by shuffling 
owing to smaller scale rearrangements. As 
such, at least four subsequent rearrangements 
would need to be invoked to explain the 
relative insertions and inversions seen in the 
duplicated segment interval. The 64 protein 
pairs in this alignment occur among 217 pro- 
tein assignments on chromosome 18, and 
among 322 protein assignments on chromo- 
some 20, for a density of involved proteins of 
20 to 30%. -This is consistent with an ancient 
large-scale duplication followed by subse- 
quent gene loss on one or both chromosomes. 
Loss of just one member of a gene pair 
subsequent to the duplication would result in 
a failure to score a gene pair in the block; less 
than 50% gene loss on the chromosomes 
would lead to the duplication density ob- 
served here. As' an independent verification 
of the significance of the alignments detect- 
ed, it can be seen that a substantial number of 
the pairs of aligning proteins in this duplica- 
tion, including some of those annotated (Fig. 
13), are those populating small Lek complete 
clusters (see above). This indicates that they 
are members of very small families of para- 
logs; their relative scarcity within the genome 
validates the uniqueness and robust nature of 
their alignments. 

Two additional qualitative features were ob- 
served among many of the large-scale duplica- 
tions. First, several proteins with disease asso- 
ciations, with OMIM (Online Mendelian Inher- 
itance in Man) assignments, are members of 
duplicated segments (see web table 2 on Sci- 
ence Online at www.sciencemag.org/cgi/con- 
tent/full/29 1/5507/1 304/DC 1 ). We have also 
observed a few instances where paralogs on 
both duplicated segments are associated with 
similar disease conditions. Notable among 
these genes are proteins involved in hemostasis 
(coagulation factors) that are associated with 
bleeding disorders, transcriptional regulators 
like the homeobox proteins associated with de- 
velopmental disorders, and potassium channels 
associated with cardiovascular conduction ab- 
normalities. For each of these disease genes, 
closer study of the paralogous genes in the 
duplicated segment may reveal new insights 
into disease causation, with further investiga- 
tion needed to determine whether they might be 
involved in the same or similar genetic diseases. 
Second, although there is a conserved number 
of proteins and coding exons predicted for spe- 
cific large duplicated spans within the chromo- 
some 18 to 20 alignment, the genomic DNA of 
chromosome 18 in these specific spans is in 
some cases more than 10-fold longer than the 
corresponding chromosome 20 DNA. This se- 
lective accretion of noncoding DNA (or con- 
versely, loss of noncoding DNA) on one of a 



pair of duplicated chromosome regions was 
observed in many compared regions. Hypothe- 
ses to explain which mechanisms foster these 
processes must be tested. 

Evaluation of the alignment results gives 
some perspective on dating of the duplications. 
As noted above, large-scale ancient segmental 
duplication in fact best explains many of the 
.:■ blocks detected by this genome-wide analysis. 

The regions of human chromosomes involved 
.: in the large-scale duplications expanded upon 
above (chromosomes 2 to 14, 2 to 12, and 18 to 
20) are each syntenic to a distinct mouse chro- 
mosomal region. The corresponding mouse . 
. , chromosomal regions are much more similar in 
sequence conservation, and even in order, to 
their human synteny partners than the human 
duplication regions are to each other. Further, 
the corresponding mouse chromosomal regions 
each bear a significant proportion of genes or- 
thologous to the human genes on which the 
human duplication assignments were made. On 
the basis of these factors, the corresponding 
mouse chromosomal spans, at coarse resolu- 
tion, appear to be products of the same large- 
scale duplications observed in humans. Al- 
though further detailed analysis must be carried 
out once a more complete genome is assembled 
for mouse, the underlying large duplications 
appear to predate the two species* divergence. 
This dates the duplications, at the latest, before 
: divergence of the primate and rodent lineages. 
This date can be further refined upon examina- 
tion of the synteny between human chromo- 
somes and those of chicken, pufferfish (Fugu 
rubripes), or zebrafish (95). The only sub- 
stantial syntenic stretches mapped in these 
species corresponding to both pairs of human 
duplications are restricted to the Hox cluster 
regions. When the synteny of these regions 
(or others) to human chromosomes is extend- 
ed with further mapping, the ages of the 
nearly chromosome-length duplications seen 
in humans are likely to be dated to the root of 
vertebrate divergence. 

The MUMmer-based results demonstrate 
large block duplications that range in size from 
a few genes to segments covering most of a 
chromosome. The extent of segmental duplica- 
tions raises the question of whether an ancient 
whole-genome duplication event is the under- 
lying explanation for the numerous duplicated 
regions (96). The duplications have undergone 
many deletions and subsequent rearrangements; 
these events make it difficult to distinguish 
between a whole-genome duplication and mul- 
tiple smaller events. Further analysis, focused 
especially on comparing the estimated ages of 
all the block duplications, derived partially 
from interspecies genome comparisons, will be 
necessary to determine which of these two hy- 
potheses is more likely. Comparisons of ge- 
nomes of different vertebrates, and even cross- 
phyla genome comparisons, will allow for the 
deconvolution of duplications to eventually re- 



- veal the stagewise history of our genome, and 
with it a history of the emergence of many of 
the key functions that distinguish us from other 
living things. 

6 A Genome-Wide Examination of 
Sequence Variations 

Summary. Computational methods were used 
to identify single-nucleotide polymorphism! 
(SNPs) by comparison of the Celera sequence 
to other SNP resources. The SNP rate be- 
tween two chromosomes was —1 per 1200 to 
1500 bp. SNPs are distributed nonrandomly 
..throughout the genome. Only a very small 
proportion of all SNPs (<1%) potentially 
impact protein function based on the func* 
tional analysis of SNPs that affect the pre- 
dicted coding regions. This results in an es- 
timate that only thousands, not millions, of " 
genetic variations may contribute to the struc- 
tural diversity of human prpteins. 

Having a complete genome sequence enables 
researchers to achieve a dramatic acceleration 
in the rate of gene discovery, but only through 
analysis of sequence variation in DNA can wc 
discover the generic basis for variation in health 
among human beings. Whole-genome shotgun 
sequencing is a particularly effective method 
for detecting sequence variation in tandem with 
whole-genome assembly. In addition, we com- 
pared the . distribution . and attributes of SNPs i 
ascertained by three other methods: (i) align- 
ment of the Celera consensus sequence to the \ 
PFP assembly, (ii) overlap of high-quality reads 
of genomic sequence (referred to as "Kwok"; 
1,120,195 SNPs) (97), and (iii) reduced repre- 
sentation shotgun sequencing (referred to as 
"TSC"; 632,640 SNPs) (98). These data were 
consistent in showing an overall nucleotide di- 
versity of -8 X 10~ 4 , marked heterogeneity 
across the genome in SNP density, and an 
oveiwhelming preponderance of noncoding 
variation that produces no change in expressed 
proteins. 

6.1 SNPs found by aligning the Celera 
consensus to the PFP assembly 

Ideally, methods of SNP discovery make full 
use of sequence depth and quality at every site, 
and quantitatively control the rate of false-pos- 
itive and false-negative calls with an explicit 
sampling model (99). Comparison of consensus 
sequences in the absence of these details ncccs- . 

sitated a more ad hoc approach (quality scores 
could not readily be obtained for the PFP as- 
sembly). First, all sequence differences between 
the two consensus sequences were identified; 
these were then filtered to reduce the contribu- 
tion of sequencing errors and misassembly. As 
a measure of the effectiveness of the filtering 
step, we monitored the ratio of transition and 
transversion substitutions, because a 2:1 ratio 
has been well documented as typical in mam- 
malian evolution (100) and in human SNP* 
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(101, 102). The filtering steps consisted of re- 
moving variants where the quality score in the 
Celera consensus was less than 30 and where 
the density of variants was greater than 5 in 400 
bp. These filters resulted in shifting the transi- 
pon-to-transversion ratio from 1.57:1 to 
1 .89 : 1 . When applied to 2.3 Gbp. of alignments 
between the Celera and PFP consensus se- 
quences, these filters resulted in identification 
of 2,104,820 putative SNPs from a total of 
2,778,474 substitution differences. Overlaps 
between this set of SNPs and those found by 
other methods are described below. 



6.2 Comparisons to public SNP 
databases 

Additional SNPs, including 2,536,021 from 
dbSNP (www.ncbi.nlm.nih.gov/SNP) and 
13,150 from HGMD (Human Gene Muta- 
tion Database, from the University of 
Wales, UK), were mapped on the Celera con- 
sensus sequence by a sequence similarity 
search with the program PowerBlast (103). The 
two largest data sets in dbSNP are the Kwok 
and TSC sets, with 47% and 25% of the dbSNP 
records. Low-quality alignments with partial 
coverage of the dbSNP sequence and align- 
ments that had less than 98% sequence identity 
between the Celera sequence and the dbSNP 
flanking sequence were eliminated. dbSNP se- 
quences mapping to multiple locations on the 
Celera genome were discarded. A total of 
2,336,935 dbSNP variants were mapped to 
1 123,038 unique locations on the Celera se- 
ance, implying considerable redundancy in 
iSNP. SNPs in the TSC set mapped to 
585,81 1 unique genomic locations, and SNPs in 
the Kwok set mapped to 438,032 unique loca- 
tions. The combined unique SNPs counts used 
in this analysis, including Celera-PFP, TSC 
and Kwok, is 2,737,668. Table 15 shows that a 
substantial fraction of SNPs identified by one of 
these methods was also found by another meth- 
od. The very high overlap (36.2%) between the 
Kwok and Celera-PFP SNPs may be due in part 
to the use by Kwok of sequences that went into 
the PFP assembly. The unusually low overlap 
(16.4%) between the Kwok and TSC sets is due 

Table 15. Overlap of SNPs from genome-wide 
SNP databases. Table entries are SNP counts for 
each pair of data sets. Numbers in parentheses are 
the fraction of overlap, calculated as the count of 
overlapping SNPs divided by the number of SNPs 
in the smaller of the two databases compared 
Total SNP counts for the databases are: Celera- 
PFP. 2,104.820; TSC, 585.811; and Kwok 438.032 
Only unique SNPs in the TSC and Kwok data sets 
were included. 
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to their being the smallest two sets. In addition 
24.5% of the Celera-PFP SNPs overlap with 
SNPs derived from the Celera genome se- 
quences (46). SNP validation in population 
samples is an expensive and laborious process, 
so confirmation on multiple data sets may pro- 
vide an efficient initial validation "in silico" (by 
computational analysis). 

One means of assessing whether the 
three sets of SNPs provide the same picture 
of human variation is to. tally the frequen- 
cies of the six possible base: changes in 
each set of SNPs (Table 16). Previous mea- 
sures .of nucleotide diversity were mostly 
derived from small-scale analysis on can- 
didate genes (101), and our analysis with 
all three data sets validates the previous 
observations at the whole-genome scale. - 
There is remarkable homogeneity, between 
•the SNPs. found in the Kwok set, the TSC 
set, and in our whole-genome shotgun (46) 
in. this substitution pattern. Compared with 
the rest of the data sets, Celera-PFP devi- 
ates slightly from the 2:1 transition-to- 
transversion ratio observed in the other 
SNP sets. This result is not unexpected, 
because some fraction of the computation- 
ally identified SNPs in the Celera-PFP 
comparison may in fact be sequence errors. 
A 2 : 1 transition:transversion ratio for the 
bona fide SNPs would be obtained if one 
assumed that 15% of the sequence differ- 
ences in the Celera-PFP set were a result of 
(presumably random) sequence errors. 

6.3 Estimation of nucleotide diversity 
from ascertained SNPs 

The number of SNPs identified varied 
widely across chromosomes. In order to 
normalize these values to the chromosome 
size and sequence coverage, we used tt, the 
standard statistic for nucleotide diversity 
(104). Nucleotide diversity is a measure, of 
per-site heterozygosity, quantifying the 
probability that a pair of chromosomes, 
drawn from the population will differ at a 
nucleotide site. In order to calculate nucle- 
otide diversity for each chromosome, we 
need to know the number of nucleotide 
sites that were surveyed for variation, and 
in methods like reduced respresentation se- 
quencing, we need to know the sequence 
quality and the depth of coverage at each 



site. These data are not readily available, so 
we could not estimate nucleotide diversity 
from the TSC effort. Estimation of nucleo- 
tide diversity from high-quality sequence 
overlaps should be possible, but again 
more information is needed on the details 
ot all the alignments. 

Estimation of nucleotide diversity from a 
shotgun assembly entails calculating for each 
• column of the malalignment, the probability 
that two or more distinct alleles are present 
and the probability of defecting" a SNP if in 
fact the alleles have different sequence (i.e 
the probability of correct sequence calls) The 
greater the depth of coverage and the higher 
the sequence quality, the higher is the chance 
of successfully detecting a SNP (J 05). Even 
after correcting for variation in coverage, the 
nucleotide diversity appeared to vary across ' 
autosomes. The significance of this heteroge- 
neity was tested by analysis of variance, with 
estimates of <rr for 100-kbp windows to esti- 
mate variability within chromosomes (for the 
Celera-PFP comparison, F = 29 73 p < 
0.0001). ' J 

Average diversity for the autosomes es- 
timated from the Celera-PFP comparison 
was 8.94 X I0-<. Nucleotide diversity on 
the X chromosome was 6.54 X 10" 4 . The 
X is expected to be less variable than au- 
tosomes, because for every four copies of 
autosomes in the population, there are only 
three X chromosomes, and this smaller ef- 
fective population size means that random 
drift will more rapidly remove variation 
from the X (106). 

Having ascertained nucleotide variation 
genome-wide, it appears that previous esti- 
mates of nucleotide diversity in humans 
based on samples of genes were reasonably 
accurate (101 , 102, 106, 107). Genome-wide, 
our estimate of nucleotide diversity was 
8.98 X 10- 4 for the Celera-PFP alignment, 
and a published estimate averaged over 10 
densely resequenced human genes was 
8.00 X 10" 4 (108). 

6.4 Variation in nucleotide diversity 
across the human genome 

Such an apparently high degree of variabil- 
ity among chromosomes . in SNP density 
raises the question of whether there is het- 
erogeneity at a finer scale within chromo- 



Tabte 16. Summary of nucleotide changes in different SNP data sets. 
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fig. 13. Segmental duplica- 
tions between chromo- 
somes in the human ge- 
nome. The 24 panels show 
the 1077 duplicated blocks 
of genes, containing 10310 
pairs of genes in total Each 
line represents a pair of ho- 
mologous genes belonging 
to a block; all blocks con- 
tain at least three genes 
on each of the chromo- 
somes where they appear. 
Each panel shows all the 
- duplications between a 
single chromosome and 
other chromosomes with 
shared blocks. The chro- 
mosome at the center of 
each panel is shown as a 
thick red line for emphasis. 
Other chromosomes are 
displayed from top to bot- 
. torn within each panel or- 
dered by chromosome 
number. The inset (bot- 
tom, center right) shows a 
close-up of one duplica- 
tion between chromo- 
somes 18 and 20, expand- 
ed to display the gene 
names of 12 of the 64 
gene pairs shown. 
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somes, and whether this heterogeneity is 
greater than expected by chance. If SNPs 
occur by random and independent mutations, 
then it would seem that there ought to be a 
Poisson distribution of numbers of SNPs in 
fragments of arbitrary constant size. The ob- 
served dispersion in the distribution of SNPs 
in 100-kbp fragments was far greater than 
predicted from a Poisson distribution (Fig. 
14). However, this simplistic model ignores 
the different recombination rates and popula- 
tion histories that exist in different regions of 
the genome. Population genetics theory holds 
that we can account for this variation with a 
mathematical formulation called the neutral - 
coalescent (109). Applying well-tested algo- 
rithms for simulating the neutral coalescent .. 
with recombination (110), and using an ef- 
fective population size of 10,000 and a per- 
base recombination rate equal to the mutation 
rate (111), we generated a distribution of num- 
bers of SNPs by this model as well (112). The 
observed distribution of SNPs has a much larg- 
er variance than either the Poisson model or the 
coalescent model, and the difference is highly 
significant This implies that there is significant 
variability across the genome in SNP density, 
an observation that begs an explanation. 

. Several attributes of the DNA sequence 
may affect the local density of SNPs, in- 
cluding the rate at which DNA polymerase 
makes errors and the efficacy of mismatch 
repair. One key factor that is likely to be 
associated with SNP density is the G+C 
content, in part because methylated cy- 
tosines in CpG dinucleotides tend to under- 
go deamination to form thymine, account- 
ing for a nearly 10-fold increase in the 
mutation rate of CpGs over other dinucle- 
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otides. We tallied the GC content and nu- 
cleotide diversities in 100-kbp windows 
across the entire genome and found that the 
correlation between them was positive (r = 
0.21) and highly significant (P < 0.0001), 
but G+C content accounted for only a 
small part of the variation. 

6.5 SNPs by genomic class 

To test . homogeneity of SNP densities 
across functional classes, we partitioned 
sites into intergenic (defined as >5 kbp 
from any predicted transcription unit), 5'- • 
UTR, exonic (missense and silent), in- 
tronic, and 3'-UTR for 10,239 known 
genes, derived from the NCBI RefSeq da- 
tabase and all human genes predicted from 
the Celera Otto annotation. In coding re- 
gions, SNPs were categorized as either si- 
lent, for those that do not change amino 
acid sequence, or missense, for those that 
change the protein product. The ratio of 
missense to silent coding SNPs in Celera- 
PFP, TSC, and Kwok sets (1.12, 0.91, and 
0.78, respectively) shows a markedly re- 
duced frequency of missense variants com- 
pared with the neutral expectation, consis- 
tent with the elimination by natural selec- 
tion of a fraction of the deleterious amino 
acid changes (112). These ratios are com- 
parable to the missense-to-silent ratios of 
0.88 and 1.17 found by Cargill et al (101) 
and by Halushka et al (102). Similar re- 
sults were observed in SNPs derived from 
Celera shotgun sequences (46). 

It is striking how small is the fraction of 
SNPs that lead to potentially dysfunctional 
alterations in proteins. In the 10,239 Ref- 
Seq genes, missense SNPs were only about 




Number of SNPs / 100 kb 

Fig. 14. SNP density in each 100-kbp interval as determined with Celera-PFP SNPs. The color codes 
are as follows: black. Celera-PFP SNP density; blue, coalescent model; and red, Poisson distribution. 
The figure shows that the distribution of SNPs along the genome is nonrandom and is not entirely 
accounted for by a coalescent model of regional history. 



0.12, 0.14, and 0.17% of the total SNP 
counts in Celera-PFP, TSC, and Kwok 
SNPs, respectively. Nonconservative pro- 
tein changes constitute an even smaller frac- 
tion of missense SNPs (47, 41, and 40% in 
Celera-PFP, Kwok, and TSC). Intergenic re- 
gions have been virtually unstudied (113), and 
we note that 75% of the SNPs we identified 
were intergenic (Table 17). The SNP rate was 
highest in introns and lowest in exons. The SNP 
rate was lower in intergenic regions than in 
introns, providing one of the first discriminators 
between these two classes of DNA. these SNP 
rates were confirmed in the Celera SNPs, which 
. also exhibited a lower rate in exons .than in *. 
introns, and in extragenic regions than in in- 
trons (46). Many of these intergenic SNPs will 
provide valuable information in the form of 
markers for linkage and association studies, and 
some fraction is likely to have a regulatory 
function as well. ^ 

7 An Overview of the Predicted 
Protein-Coding Genes in the Human 
Genome 

Summary. This section provides an initial 
computational analysis of the predicted 
protein set with the aim of cataloging 
prominent differences and similarities 
when the human genome is compared with 
other fully, sequenced eukaryotic genomes. 
Over 40% of the predicted protein set in 
humans cannot be ascribed a molecular 
function by methods that assign proteins to 
known families. A protein domain-based 
analysis provides a detailed catalog of the 
prominent differences in the human ge- 
nome when compared with the fly and 
worm genomes. Prominent among these are 
domain expansions in proteins involved in 
developmental regulation and in cellular 
processes such as neuronal function, hemo- 
stasis, acquired immune response, and cy- 
toskeletal complexity. The final enumera- 
tion of protein families and details of pro- 
tein structure will rely on additional exper- 
imental work and comprehensive manual 
cu rati on. 

A preliminary analysis of the predicted hu- 
man protein-coding genes was conducted. 
Two methods were used to analyze and clas- 
sify the molecular functions of 26,588 pre- 
dicted proteins that represent 26,383 gene 
predictions with at least two lines of evidence 
as described above. The first method was 
based on an analysis at the level of protein 
families, with both the publicly available 
Pfam database (114 9 115) and Celera's Pan- 
ther Classification (CPC) (Fig. 15) (116). 
The second method was based on an analysis 
at the level of protein domains, with both the 
Pfam and SMART databases (115, 117). 

The results presented here are prelimi- 
nary and are subject to several limitations. 
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Both the gene predictions and functional 
assignments have been made by using com- 
putational tools, although the statistical 

k models in Panther, Pfam, and SMART have 
en built, annotated, and reviewed by ex- 
pert biologists. In the set of computationally 
predicted genes, we expect both false-positive 
predictions (some of these may in fact be inac- 
tive pseudogenes) and false-negative predic- 
* tions (some human genes will hot be computa- 
tionally predicted). We also, expect errors in 
delimiting the boundaries of exons and genes. 
Similarly, in the automatic functional assign- 
ments, we also expect both false-positive and 
false-negative predictions. The functional as- 
signment protocol focuses on protein families 
that tend to be found across several organisms, 
or on families of known human genes. There- 
fore, we do not assign a function to many genes 
that are not in large families, even if the func- 
tion is known. Unless otherwise specified, all 
enumeration of the genes in any given family or 
functional category was taken from the set of 
26,588 predicted proteins, which were assigned 
functions by using statistical score cutoffs de- 
fined for models in Panther, Pfam, and 
SMART. 

For this initial examination of the pre- 
dicted human protein set, three broad ques- 
tions were asked: (i) What are the likely 
molecular functions of the predicted gene 
products, and how are these proteins cate- 
gorized with current classification meth- 
00 What are the core functions that 
^^pear to be common across the animals? 
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(iii) How does the human protein comple- 
ment differ from that of other sequenced 
eukaryotes? 

7.1 Molecular functions of predicted 
human proteins 

Figure 15 shows an overview of the puta- 
tive molecular functions of the predicted 
26,588 human proteins that have at Jeast 
two lines of. supporting evidence. About 
41% (12,809) of the; gene products could 
not be classified from this initial . analysis 
and are termed proteins with unknown 
functions. Because our automatic classifi- 
cation methods treat only relatively large 
protein families, there are a number of 
"unclassified" sequences that do, in fact, 
have a known or predicted function. For the 
60% of the protein set that have automatic 
functional predictions, the specific protein 
functions have been placed into broad 
classes. We focus here on molecular func- 
tion (rather than higher order cellular pro- 
cesses) in order to classify as many proteins 
as possible. These functional predictions 
are based on similarity to sequences of 
known function. 

In our analysis of the 12,731 additional low- 
confidence predicted genes (those with only 
one piece of supporting evidence), only 636 
(5%) of these additional putative genes were 
assigned molecular functions by the automated 
methods. One-third of these 636 predicted 
genes represented endogenous retroviral pro- 
teins, further suggesting mat the majority of 



these unknown-function genes are not real 
genes. Given that most of these additional 
, 12,095 genes appear to be unique among the 
genomes sequenced to date/many may simply 
represent false-positive gene predictions. 

The most common molecular functions are 
the transcription factors and those involved in 
nucleic acid metabolism (nucleic acid enzyme). 
Other functions that are highly represented in 
the . human genome are the receptors, kinases, 
and -hydrolases. Not surprisingly,- most of the 
hydrolases are proteases. There are also many 
proteins that are members of proto-oncogene 
families, as well as families of "select regula- 
tory molecules": (i) proteins involved in specif- 
ic steps of signal transduction such as hetero- 
trimeric GTP-binding proteins (G proteins) and 
cell cycle regulators, and (ii) proteins that mod- 
ulate the activity of kinases, G proteins, and 
phosphatases. 

Table 17. Distribution of SNPs in classes of 
genomic regions. 
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Fig. 15. Distribution 
of the molecular 
functions of 26383 
human genes. Each 
slice lists the num- 
bers and percentages 
(in parentheses) of 
human gene functions 
assigned to a given 
category of molecular 
function. The outer cir- 
cle shows the assign- 
ment to molecular 
function categories in 
the Gene " Ontology 
(GO) (779), and the 
inner circle shows 
the assignment to 
Celera's Panther mo- 
lecular function cate- 
gories (776). 
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7.2 Evolutionary conservation of core 
processes 

Because of the various "model organism" 
genome-sequencing projects that have al- 
ready been completed, reasonable compara- 
tive information is available for beginning the 
analysis of the evolution of the human ge- 
nome. The genomes of S. cerevisiae ("bak- . 
ers' yeast") {118) and two diverse inverte- 
brates, C. elegans (a nematode worm) (119) 
and D. melanogaster (fly) (26), as well as the 
first plant genome, A. thaliana, recently com- 
pleted (92), provide a diverse background for 
genome comparisons. 

We enumerated the "strict orthologs" con- 
served between human and fly, and between 
human and worm (Fig. 16) to address the 
question, What are the core functions that 
appear to be common across the animals? 
The concept of orthology is important be- 
cause if two genes are orthologs, they can be 
traced by descent to the common ancestor of 
the two organisms (an "evolutionarily con- 
served protein set"), and therefore are likely 
to perform similar conserved functions in the 
different organisms. It is critical in this anal- 
ysis to separate orthologs (a gene that appears 
in two organisms by descent from a common 
ancestor) from paralogs (a gene that appears 
in more than one copy in a given organism by . 
a duplication event) because paralogs may . 
subsequently diverge in function. Following 
the yeast-worm ortholog comparison in 
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(120), we identified two different cases for 
each pairwise comparison (human-fly and 
human-worm). The first case was a pair of 
genes, one from each organism, for which 
there was no other close homolog in either 
organism. These are straightforwardly identi- 
fied as orthologous, because there are no 
. additional members of the families that com- 
plicate separating orthologs from paralogs. 
The second case is a family of genes with 
more than one member in either or both of the 
organisms being compared. Chervitz et ah 
(120) deal with this case by analyzing a 
phylogenetic tree that described the relation- 
ships between all of the sequences in both 
organisms, and then looked for pairs of genes 
.that were nearest neighbors in the tree. If the 
nearest-neighbor pairs were from different 
organisms, those genes were presumed to be 
orthologs. We note that these nearest neigh- 
bors can often be confidently identified from 
pairwise sequence comparison without hav- 
ing to examine a phylogenetic tree (see leg- 
end to Fig. 16). If the nearest neighbors are 
not from different organisms, there has been 
a paralogous expansion in one or both organ- 
isms after the speciation event (and/or a gene 
loss by one organism). When this one-to-one 
correspondence is lost, defining an ortholog 
becomes ambiguous. For our initial compu- 
tational overview of the predicted human pro- 
tein set, we could not answer this question for 
every predicted protein. Therefore, we con- 



sider only "strict orthologs," i.e., the proteins 
with unambiguous one-to-one relationships 
(Fig. 16). By these criteria, there are 2758 
strict human-fly orthologs, 2031 human- 
worm (1523 in common between these sets). 
We define the evolutionarily conserved set as 
those 1523 human proteins that have strict 
orthologs" in both D. .melanogaster and C. 
elegans, \ 

The distribution of the functions of the 
conserved protein set is shown in Fig. 16. 
Comparison with Fig. 15 shows that, not 
surprisingly, the set of conserved proteins is 
.not distributed among molecular functions in 
the same way as the whole human protein set. 
Compared with the whole human set (Fig. 
15), there are several categories that are over- 
represented in the conserved set by a factor of 
~~2 or more. The first category is nucleic acid 
enzymes, primarily the transcriptional ma- 
chinery (notably DNA/RNA methyltrans- 
ferases, DNA/RNA polymerases, helicases, 
DNA ligases, DNA- and RNA-processing 
factors, nucleases, and ribosomal proteins). 
The basic, transcriptional and translational 
machinery is well known to have been con- 
served over evolution, from bacteria through 
to the most complex eukaryotes. Many ribo- 
nucleoproteins involved in RNA splicing also 
appear to be conserved among the animals. 
Other enzyme types are also overrepresent- 
ed (transferases, oxidoreductases, ligases, 
lyases, and isomerases). Many of these en- 



Fig. 16. Functions of putative 
orthologs across vertebrate 
and invertebrate genomes. 
Each slice lists the number and 
percentages (in parentheses) 
of "strict orthologs" between 
the human, fly, and worm ge- 
nomes involved in a given cat- 
egory of molecular function. 
"Strict orthologs" are defined 
here as bi-directional BLAST 
best hits (780) such that each 
orthologous pair (i) has a 
BLASTP P-value of <10~ 10 
(720), and (ii) has a more sig- 
nificant BLASTP score than 
any paralogs in either organ- 
ism, i.e., there has likely been 
no duplication subsequent to 
speciation that might make 
the orthology ambiguous. This 
measure is quite strict and is a 
lower bound on the number of 
orthologs. By these criteria, 
there are 2758 strict human- 
fly orthologs, and 2031 hu- 
man-worm orthologs {1523 in 
common between these sets). 
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2ymes are involved in intermediary metabo- 
lism. The only exception is the hydrolase 
category, which is not significantly overrep- 

• resented in the shared protein set. Proteases 
(Form the largest part of this category, and 
several large protease families have expanded 
in each of these three organisms after their 
divergence. The category of select regulatory 
molecules is also overrepresented in the con- 
served set. The major Conserved families are 
- small guanosine triphosphatases (GTPases) 
(especially the Ras-related superfamily, in- 
cluding ADP ribosylation factor) and cell 
cycle regulators (particularly the cullin fam- 
ily, cyclin C family, and several cell division 
protein kinases). The last two significantly 
overrepresented categories are protein trans- 
port and trafficking, and chaperones. The 
most conserved groups in these categories are 
proteins involved in coated vesicle-mediated 
transport, and chaperones involved in protein 
folding and heat-shock response [particularly 
the DNAJ family, and heat-shock protein 
60 (HSP60), HSP70, and HSP90 families]. 
These observations provide only a conserva- 
tive estimate of the protein families in the 
context of specific cellular processes that 
were likely derived from the last common 
ancestor of the human, fly, and worm. As 
stated before, this analysis does not provide a 
complete estimate of conservation across the 
three animal genomes, as paralogous dupli- 
cation makes the determination of true or- 
^Mologs difficult within the members of con- ■ 
^Rerved protein families. 

7.3 Differences between the human 
genome and other sequenced 
eukaryotic genomes 
To explore the molecular building blocks of 
the vertebrate taxon, we have compared the 
human genome with the other sequenced 
eukaryotic genomes at three levels: molec- 
ular functions, protein families, and protein 
domains. 

Molecular differences can be correlated 
with phenotypic differences to begin to reveal 
the developmental and cellular processes that 
are unique to the vertebrates. Tables 18 and 
1 9 display a comparison among all sequenced 
eukaryotic genomes, over selected protein/ 
domain families (defined by sequence simi- 
larity, e.g., the serine-threonine protein ki- 
nases) and superfamilies (defined by shared 
molecular function, which may include sev- 
eral sequence-related families, e.g., the cyto- 
kines). In these tables we have focused on 
(super) families that are either very large or 
that differ significantly in humans compared 
with the other sequenced eukaryote genomes. 
We have found that the most prominent hu- 
^un expansions are in proteins involved in (i) 
^■uired immune functions; (ii) neural devel- 
opment, structure, and functions; (iii) inter- 
cellular and intracellular signaling pathways 
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in development and homeostasis; (iv) hemo- 
stasis; and (v) apoptosis. 

Acquired immunity. One , of the most 
striking differences between the human ge- 
nome and the Drosophila or G elegans ge- 
nome is the appearance of genes involved in 
acquired immunity (Tables 18 and 19). This 
is expected, because the acquired immune 
response is a defense system that only occurs 
in vertebrates. We observe 22 class I and 22 
: . class . II major.; histocompatibility' complex 
(MHC) antigen genes and 114 other immu- 
noglobulin genes in the human genome. In 
addition, there are 59 genes in the cognate 
immunoglobulin receptor family. At the do- 
main level, this is exemplified by an expan- 
sion and recruitment of the ancient immuno- 
globulin fold to constitute molecules such as 
MHC, and of the integrin fold to form several 
of the cell adhesion molecules that mediate 
interactions between immune effector cells 
and the extracellular matrix. Vertebrate-spe- 
cific proteins include the paracrine immune 
regulators family, of secreted 4-alpha helical 
bundle proteins, namely the cytokines and 
chemokines. Some of the cytoplasmic signal 
transduction components associated with cy- 
tokine receptor signal transduction are also 
features that are poorly represented in the fly 
and worm. These include protein domains 
found in the signal transducer and activator of 
transcription (STATs), the suppressors of cy- 
tokine signaling (SOCS), and protein inhibi- 
tors of activated STATs (PIAS). In contrast, 
many of the animal-specific protein domains 
that play a role in innate immune response, 
such as the Toll receptors, do not appear to be 
significantly expanded in the human genome. 
Neural development, structure, and 
. function. In the human genome, as compared 
with the worm and fly genomes, there is a 
marked increase in the number of members 
of protein families that are involved in 
neural development. Examples include neu- 
rotrophic factors such as ependymin, nerve 
growth factor, and signaling molecules 
such as semaphorins, as well as the number 
of proteins involved directly in neural 
structure and function such as myelin pro- 
teins, voltage-gated ion channels, and syn- 
aptic proteins such as synaptotagmin. 
These observations correlate well with the 
known phenotypic differences between the 
nervous systems of these taxa, notably (i) 
the increase in the number and connectivity 
of neurons; (ii) the increase in number of 
distinct neural cell types (as many as a 
thousand or more in human compared with 
a few hundred in fly and worm) (121); (iii) 
the increased length of individual axons; 
and (iv) the significant increase in glial cell 
number, especially the appearance of my- 
elinating glial cells, which are electrically 
inert supporting cells differentiated from 
the same stem cells as neurons. A number 



of prominent protein expansions are in- 
volved in the processes of neural develop- 
ment. Of the extracellular domains that me- 
diate cell adhesion, the connexin domain- 
containing proteins (122) exist only in hu- 
mans. These proteins, which are not present 
in the Drosophila or C. elegans genomes, 
appear to provide the constitutive subunits 
of intercellular channels and the structural 
basis for electrical coupling.; Pathway find- 
ing by axons and neuronal network forma- 
tion is mediated through a subset of ephrins 
and their cognate receptor tyrosine kinases 
that act as positional labels to establish 
topographical projections (123). The prob- 
able biological role for the semaphorins (22 
in human compared with 6 in the fly and 2 
in the worm) and their receptors (neuropi- 
lins and plexins) is that of axonal guidance 
molecules (124). Signaling molecules such 
as neurotrophic factors and some cytokines 
have been shown to regulate neuronal cell 
survival, proliferation, and axon guidance 
(125). Notch receptors and ligands play 
important roles in glial cell fate determina- 
tion and gliogenesis (126). 

Other human expanded gene families play 
key roles directly in neural structure and 
function. One example is synaptotagmin (ex- 
panded more than twofold in humans relative 
to the invertebrates), originally found to reg- 
ulate synaptic transmission by serving as a 
Ca 2 "*" sensor . (or receptor) during synaptic 
vesicle fusion and release (127). Of interest is 
the increased co-occurrence in humans of 
PDZ and the SH3 domains in neuronal - 
specific adaptor molecules; examples include 
proteins that likely modulate channel activity 
at synaptic junctions (128). We also noted 
expansions in several ion-channel families 
(Table 19), including the EAG subfamily 
(related to cyclic nucleotide gated channels), 
the voltage-gated calcium/sodium channel 
family, the inward-rectifier potassium chan- 
nel family, and the. voltage-gated potassium 
channel, alpha subunit family. Voltage-gated 
sodium and potassium channels are involved 
in the generation of action potentials in neu- 
rons. Together with voltage-gated calcium 
channels, they also play a key role in cou- 
pling action potentials to neurotransmitter re- 
lease, in the development of neurites, and in 
short-term memory. The recent observation 
of a calcium-regulated association between 
sodium channels and synaptotagmin may 
have consequences for the establishment and 
regulation of neuronal excitability (129). 

Myelin basic protein and myelin-associat- 
ed glycoprotein are major classes of protein 
components in both the central and peripheral 
nervous system of vertebrates. Myelin P0 is a 
major component of peripheral myelin, and 
myelin proteolipid and myelin oligodendro- 
cyte glycopotein are found in the central 
nervous system. Mutations in any of these 
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Table 18. Domain-based comparative analysis of proteins in H. sapiens (H), 
D. metanogaster (F), C. etegans (W), S. cerevisiae (Y), and A. thaliana (A). The . 
predicted protein set of each of the above eukaryotic organisms was analyzed 
with Pfam version 5.5 using E value cutoffs of 0.001. The number of proteins 
containing the specified Pfam domains as well as the total number of domains 
(in parentheses) are shown in each column. Domains were categorized into 
cellular processes for presentation. Some domains (i.e., SH2) are listed in 



r«u^ P T SS - ° f the Pfam ™ [ y™ ™y di ^r from 

iTmit^nn, nM . " ° f P r0tein 1*™®*. Owing to the 

lim.tat.ons of. large-scale automatic classifications. Representative example! 
of drains with reduced counts owing to the stringent E valuTcSff^SS 
th.s analyse are marked with a double asterisk (♦♦). Examples include short 
emergent and predominantly alpha-helical domains, and c«L?^ of 
cysteine-nch zinc finger proteins. ■ 



Accession 
number 



Domain name 



Domain description 



H 



W 



PF02039 

PF00212 

PF00028 

PF00214 

PF01110 

PF01093 • 

PF00029 

PF00976 

PF00473 

PF00007 

PF00778 

PF00322 

PF00812 

PF01404 

PF00167 

PF01534 

PF00236 

PF01153 

PF01271 

PF02058 

PF00049 

PF00219 

PF02024 

PF00193 

PF00243 

PF02158 

PF00184 

PF02070 

PF00066 

PF00865 

PF00159 

PF01279 

PF00123 

PF00341 

PF01403 

PF01033 

PF00103 

PF02208 

PF02404 

PF01034 

PF00020 

PF00019 

PF01099 

PF01160 

PF00110 

PF01821 
PF00386 
PF00200 
PF00754 
PF01410 
PF00039 
PF00040 
PF00051 
PF01823 
PF00354 
PF00277 
PF00084 
PF02210 
PF01108 
PF00868 
PF00927 



Adrenomedullin 
ANP 

Cadherin 
Calc.CGRPJAPP 
CNTF 
^.Clusterin . 
Connexin 
ACTH_domain 
• CRF 
Cysjcnot 
DIX 

Endothelin 

Ephrin 

EPhJbd 

FGF 

Frizzled 

Hormone6 

Clypican 

Granin 

Guanylin 

Insulin 

IGFBP 

Leptin 

Xlink 

NGF 

Neuregulin 
HormoneS 
NMU 
Notch 

Osteopontin 

Hormone3 

Parathyroid 

Hormone2 

PDGF 

Sema 

Somatomedin^ 

Hormone 

Sorb 

SCF 

Syndecan 

TNFR_c6 

TGF-p 

Uteroglobin 

Opiods_neuropep 

Wnt 

ANATO 
C1q 

Disintegrin 

F5_F8_type_C 

COLFi 

Fnl 

Fn2 

Kringle 

MACPF 

Pentaxin 

SAA^proteins 

Sushi 

TSPN 

Tissue.fac 

Transglutamin_N 

Transglutamin_C 



. Developmental and homeostatic 

. Adrenomedullin 

Atrial natriuretic peptide 
Cadherin domain 
Calcitonin/CGRP/IAPP family 
Ciliary neurotrophic factor 
Clusterin 
Connexin 

Corticotropin ACTH domain 

Corticotropin-releasing factor family 

Cystine-knot domain 

Dix domain 

Endothelin family 

Ephrin 

Ephrin receptor ligand binding domain 
Fibroblast growth factor 
Frizzled/Smoothened family membrane region 
Glycoprotein hormones 
Glypican 

Grainin (chromogranin or secretogranin) 

Guanylin precursor 

Insulin/IGF/Relaxin family 

Insulin-like growth factor binding proteins 

Leptin 

LINK (hyaluron binding) 
Nerve growth factor family 
Neuregulin family 
Neurohypophysial hormones 
Neuromedin U 
Notch (DSL) domain 
Osteopontin 

Pancreatic hormone peptides 
Parathyroid hormone family 
Peptide hormone 

Platelet-derived growth factor (PDGF) 
Sema domain 
Somatomedin B domain 
Somatotropin 

Sorbin homologous domain 

Stem cell factor 

Syndecan domain 

TNFR/NGFR cysteine-rich region 

Transforming growth factor p-like domain 

Uteroglobin family 

Vertebrate endogenous opioids neuropeptide 
Wnt family of developmental signaling proteins 

Hemostash 

Anaphylotoxin-like domain 

Clq domain 

Disintegrin 

F5/8 type C domain 

Fibrillar collagen C-terminal domain 

Fibronectiri type I domain 

Fibronectin type II domain 

Kringle domain 

MAC/Perforin domain 

Pentaxin family 

Serum amyloid A protein 

Sushi domain (SCR repeat) 

Thrombospondin N-terminal-tike domains 

Tissue factor 

Transglutaminase family 

Transglutaminase family 



regulators 

1 

2 

100(550) 
3 
1 
3 

.". 14(16) 
1 
2 

10(11) 
5 
3 

7(8) 
12 
23 
9 
1 
14 
3 
1 
7 
10 
1 

13(23) 
3 
4 
1 

3(5) 
1 
3 

5(9) 
5 

27(29) 
5(8) 
1 
2 
2 

17(31) 
27(28) 
3 
3 
18 

6(14) 
24 
18 
15(20) 
. 10 
5(18) , 
11(16) 
15(24) 
6 
9 
4 

53(191) 
14 
1 
6 
8 



0 
0 

14(157) 
0 
0 

: o 
o 

0 

1 

2 
2 
0 
2 
2 
1 
7 
0 
2 
0 
0 
4 
0 
0 
0 
0 
0 
0 
0 

2(4) 
0 
0 
0 
0 

1 

8(10) 
3 
0 
0 
0 
1 
1 
6 
0 
0 

7(10) 

0 
0 

5(6) 
0 
0 
0 
2 
0 
0 
0 

11(42) 
1 
0 
1 
1 



0 
0 

16(66) 
0 
0 
0 
0 
0 
0 
0 
4 
0 
4 
1 
1 
3 
0 
1 
0 
0 
0 
0 
0 

1 

0 
0 
0 
0 

2(6) 
0 
0 
0 
0 

0 . 
3(4) 
0 
0 
0 
0 

1 

0 
4 
0 
0 
5 

0 
0 
3 
2 

0 ' 

0 

0 

2 

0 

0 

0 

8(45) 
0 
0 
0 
0 



0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 

0' 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 
. 0 

0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 

0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 



0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
. 0 

b 

0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 

0 
0 
0 
0 
0 
0 

o 

0 
0 
0 
0 
0 
0 
0 
0 
0 
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Accession 
number 



Domain name 
Gla 



Domain description 



W 



^PF< 



' PF00594 



PF00711 
PF00748 
PF00666 
. PF00129 

PF00993 
PF00969 
PF00879 
PF01109 
PF00047 
PF00U3 
PF00714 
PF00726 
PF02372 
PF00715 
PF00727 
PF02025 
PF01415 
PF00340 
PF02394 
PF02059 
PF00489 
PF01291 

PF00323 
PF01091 
PF00277 
PF00048 

J>F01582 
1F00229 
PF00088 

PF00779 
PF00168 
PF00609 
PF00781 
PF00610 

PF01363 
PF00996 
PF0O503 
PF00631 
PF00616 
PF00618 

PF00625 
PF02189 
PF00169 
PF00130 

PF00388 

PF00387 

PF00640 
PF02192 
PF00794 
PF01412 
PF02196 
PF02145 
PF00788 
PF00071 
100617 
'00615 
F02197 



Defensin_beta 
Calpainjnhib 
'* Cathelicidins 
MHCJ 

MHCJLalpha** 
MHCJI.beta** 
Defensin propep 

gm_csf~ 

Interferon 

IFN-gamma 

IL10 

IL15 

IL2 

IL4 

IL5 

IL7 

IL1 

IL1_propep 

IL3 

IL6 

LIF_OSM 

Defensins 
PTN.MK 
SAA^roteins 
IL8 

TIR 
TNF 
Trefoil 

BTK 
C2 

DAGKa 
DACKc 
DEP 

FYVE 
GDI 

G-alpha . 
G-gamma 
RasGAP 
RasGEFN 

Guanylatejcin 

ITAM 

PH 

DAC_PE-bind 
PI-PLC-X 
PI-PLC-Y 
PID 

PI3K_p85B 
PI3K_rbd 
ArfGAP 
RBD 

Rap.GAP 

RA 

Ras 

RasGEF 

RGS 

Rlla 



Vitamin K-dependent carboxylation/gamma- 
carboxyglutamic (GLA) domain 

Immune response 

Beta defensin 

Calpain inhibitor repeat • 

Cathelicidins " 

Class I histocompatibility ahtigea domains alpha 1 
and 2 

Class II histocompatibility antigen/ alpha domain 
Class II histocompatibility antigen, beta domain 
Defensin propeptide 

Granulocyte-macrophage colony-stimulating factor 

Immunoglobulin domain 

Interferon alpha/beta domain 

Interferon gamma 

lnterleukin-1 0 

lnterleukin-1 5 

lnterleukin-2 

lnterleukin-4 

Interleukin-5 

lnterleukin-7/9 family 

lnterleukin-1 

lnterleukin-1 propeptide 

lnterleukin-3 

lnterleukin-6/G-CSF/MGF family 

Leukemia inhibitory factor (LIF)/oncostatin (OSM) 

family 
Mammalian defensin 
PTN/MK heparin-binding protein 
Serum amyloid A protein 
Small cytokines (intecrine/chemokine), 

interleukin-8 like 
TIR domain 

TNF (tumor necrosis factor) family . 
Trefoil (P-type) domain 

Pt-PY-rho CTPase signaling 

BTK motif 
C2 domain 

Diacylglycerol kinase accessory domain (presumed) 
Diacylglycerol kinase catalytic domain (presumed) 
Domain found in Dishevelled, Egl-10, and 

Pleckstrin (DEP) 
FYVE zinc finger 
GDP dissociation inhibitor 
G-protein alpha subuntt 
G-protein gamma like domains 
GTPase-activator protein for Ras-like GTPase 
Guanine nucleotide exchange factor for Ras-like 

GTPases; N-terminal motif 
Guanylate kinase 

Immunoreceptor tyrosine-based activation motif 
PH domain 

Phorbol esters/diacylglycerol binding domain (C1 
domain) 

Phosphatidylinositol-specific phospholipase C, X 
domain 

Phosphatidylinositol-specific phospholipase C, Y 
domain 

Phosphotyrosine interaction domain (PTB/PID) 
PI3-kinase family, p85-binding domain 
PI3-kinase family, ras-binding domain 
Putative GTP-ase activating protein for Arf 
Raf-like Ras-binding domain 
Rap/ran-GAP 

Ras association (RalGDS/AF-6) domain 
Ras family 
RasGEF domain 

Regulator of G protein signaling domain 
Regulatory subunit of type II PKA R-subunit 



. - 3(9) 
2 

18(20) 

5(6) 
7 
3 
1 

381 (930J 
7(9) 
1 
1 
1 
1 
1 
1 
1 
7 
1 
1 
2 
2 



0 
0 
0 

■ • 0 

0 
0 
0 
0 

125(291) 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 



0 
0 
0 

. :° 

0 
0 
0 
0 

67(323) 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 



0 
0 

/ 0 

. 0 

0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 



0 
0 

: 0 
0. 

0 

0 

0 

0 

0 

0 

0. 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 



2 


0 


0 


0 


0 


2 


0 


0 


0 


0 


4 


0 


0 


0 


0 


32 


0 


0 


0 


0 


18 


8 


2 


b 


131(143) 


12 


0 


0 


b 


0 


5(6) 


0 


2 


0 


0 


5 


1 


0 


0 


0 


73 (101) 


32 (44) 


24(35) 


6(9) 


66 (90) 


9 


4 


7 


0 


6 


10 


8 


8 


2 


11(12) 


12(13) 


4 


10 


5 


2 


28 (30) 


14 


15 


5 


15 


6 


2 


1 


1 


3 


27(30) 


10 


20(23) 


2 


5 


16 


5 


5 


1 


0 


11 


5 


8 


3 


0 


9 


2 


3 


5 


0 


12 


8 


7 


1 


4 


3 


0 


0 


0 


0 


193 (212) 


72(78) 


65 (68) 


24 


23 


45(56) 


25(31) 


26 (40) 


1(2) 


4 


12 


3 


7 


1 


8 


11 


2 


7 


1 


8 


24(27) 


13 


11(12) 


0 


0 


2 


1 


1 


0 


0 


6 


3 


1 


0 


0 


16 


9 


8 


6 


15 


6(7) 


4 


1 


0 


0 


5 


4 


2 


0 


0 


18(19) 


7(9) 


6 


1 


0 


126 


56(57) 


51 


23 


78 


21 


8 


7 


5 


0 


27 


6(7) 


12(13) 


1 


0 


4 


1 


2 


1 


0 
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Accession 
number 



Domain name 



Domain description 



H 



W 



PF00620 


RhoGAP 


PF00621 


RhoCEF 


PF00536 


SAM 


PF01369 


Sec7 


PF00017 


SH2 


PF00018 


SH3 


PF01017 


STAT 


PFOO790 


VHS 


PF00568 


WH1 


PF00452 


Bd-2 


PF02180 


■ BH4 


PF00619 


CARD 


PF00531 


Death 


PF01335 


DED 


PF02179 


BAG 


PF00656 


ICE_p20 


PF00653 


BIR 


PF00022 


Actin 


PF00191 


Annexin 




Catponin 


PF00373 


Band_41 


PF00880 


Nebulin_repeat 


PF00681 


Plectin_repeat 


PF00435 


Spectrin 


PF00418 


Tubulin-binding 


PF00992 


Troponin 


PF02209 


VHP " 


PF01044 


Vinculin 


PF01391 


Collagen 


PF01413 


C4 


PF00431 


CUB 


PF00008 


EGF 


PF00147 


Fibrinogen_C 



PF00041 

PF00757 

PF00357 

PF00362 

PF00052 

PF00053 

PF00054 

PFO0O55 

PF00059 

PF01463 

PF01462 

PF00057 

PF00058 

PF0O530 

PF00084 

PF00090 

PF00092 

PF00093 

PF00094 

PF00244 . 

PF00023 

PF00514 

PF00168 

PF00027 

PF01556 

PF00226 

PF00036 

PF00611 

PF01846 

PF00498 



Fn3 

Furin-like 

lntegrin_A 

lntegrin_B 

Laminin_B 

Laminin^EGF 

Laminin_G 

Laminin_Nterm 

Lectin_c 

LRRCT 

LRRNT 

Ldl__recept_a 

Ldl_recept b 

SRCR 

Sushi 

Tsp_1 

Vwa 

Vwc 

Vwd 

14-3-3 
Ank 

Armadillo_seg 
C2 

cNMPJ>inding 

DnaJ_C 

DnaJ 

Efhand** 

FCH 

FF 

FHA 



RhoGAP domain 
RhoGEF domain 

SAM domain (Sterile alpha motif) 

* Sec7 domain 
Src homology 2 (SH2) domain 
Src homology 3 (SH3) domain 
STAT protein 
VHS domain 
WH1 domain 

Domains involved in apoptosis 
Bcl-2 - 

Bcl-2 homology region 4 
Caspase recruitment domain 
Death domain 
Death effector domain 
Domain present in Hsp70 regulators 
ICE-like protease (caspase) p20 domain 
Inhibitor of Apoptosis domain 

. Cytosketetal 

Actin 
Annexin 
Calponin family 

FERM domain (Band 4.1 family) 

• Nebulin repeat 
Plectin repeat 
Spectrin repeat 

Tau and MAP proteins, tubulin-binding 
Troponin 

Villin headpiece domain 
Vinculin family 

fCM adhesion 
Collagen triple helix repeat (20 copies) 
C-terminal tandem repeated domain in type 4 

procollagen 
CUB domain 
EGF-like domain 

Fibrinogen beta and gamma chains, C-terminal 

globular domain 
Fibronectin type III domain 
Furin-like cysteine rich region 
Integrin alpha cytoplasmic region 
Integrins, beta chain 
Laminin B (Domain IV) 
Lamjnin EGF-like (Domains lll and V) 
Laminin G domain 
Laminin N-terminal (Domain VI) 
Lectin C-type domain 
Leucine rich repeat C-terminal domain 
Leucine rich repeat N-terminal domain 
Low-density lipoprotein receptor domain class A 
Low-density lipoprotein receptor repeat class B 
Scavenger receptor cysteine-rich domain 
Sushi domain (SCR repeat) 
Thrombospondin type 1 domain 
von Willebrand factor type A domain 
von Willebrand factor type C domain 
von Willebrand factor type D domain 

Protein interaction domains 

14-3-3 proteins 
Ank repeat 

Armadillo/beta-catenin-like repeats 
C2 domain - 

Cyclic nucteotide-binding domain 
DnaJ C terminal region 
DnaJ domain 
EF hand 

Fes/CIP4 homology domain 
FF domain 
FHA domain 



59 
46 
29(31) 
13 

. 87(95) 
143 (182) 
7 
4 
7 

9 
3 
16 
16 
4(5) 
5(8) 
11 
8(14) 

61(64) 
16(55) 
13(22) 
29 (30) 
4(148) 
2(11) 
31 (195) 
4(12) 
4 
5 
4 

65(279) 
6(11) 

47 (69) 
108 (420) 
26 

106(545) 
5 
3 
8 

8(12) 
24(126) 
30(57) 
10 

47(76) 
69(81) 
40(44) 
35(127) 
15(96) 
11(46) 
53 (191) 
41 (66) 
34(58) 
19(28) 
15(35) 



20 

145 (404) 
22(56) 
73(101) 
26(31) 
12 
44 

83(151) 
9 

4(11) 
13 



19 
23(24) 
15 
5 

33(39) 
55(75) 
1 
2 
2 

2 
0 
0 
5 
0 

3 . 
5(9) 

15(16) 
4(16) 
3 

17(19) 

1(2) 
0 

13(171) 
1(4) 
6 
2 
2 

. 10(46) 
2(4) 

9(47) 
45(186) 
10(11) 

42(168) 
2 
1 
2 

4(7) 
9(62) 
18(42) 
6 

23(24) 
23 (30) 
7(13) 
33(152) 
9(56) 
4(8) 
11(42) 
11(23) 
0 

6(H) 
3(7) 

3 

72 (269) 
11(38) 
32 (44) 
21 (33) 
9 
34 

64(117) 
3 

4(10) 
15 



20 
18(19) 
8 
5 

44(48) 
46(61) 

1(2) 
4 

2(3) 

1 
1 
2 
7 
0 
2 
3 

2(3) 



9 
3 
3 
5 

23(27) 
0 
4 
1 

0 
0 
0 
0 
0 
1 
0 

1(2) 



8 
0 
6 
9 
3 
'4 
0 
8 
0 

0 
.0 
0 
0 
0 
5 
0 
0 



12 


9(11) 


24 


4(11) 


0 


6 (16) 


7(19) 


o 


o 


11(14) 


o 


o 


1 


0 


o 


0 


0 


o 


10(93) 


0 


0 


2(8) 


0 


0 


8 


0 


0 


2 


0 


5 


1. 


0 


0 


174(384) 


. 0 


0 


3(6) 


0 


0 


43(67) 


0 


0 


54(157) 


0 


1 


6 


0 


0 


34(156) 


0 


1 


1 


0 


0 


2 


0 


0 


2 


0 


0 


6(10) 


0 


0 


11(65) 


0 


0 


14(26) 


0 


0 


4 


0 


0 


91 (132) 


0 


0 


7(9) 


0 


0 


3(6) 


0 


0 


27(113) 


0 


0 


7(22) 


0 


0 


1(2) 


0 


0 


8(45) 


0 


0 


18(47) 


0 


0 


17(19) 


0 


1 


2(5) 


0 


0 


9 


0 


0 



3 


2 


15 


75(223) 


12(20) 


66(111) 


3(11) 


2(10) 


25(67) 


24(35) 


6(9) 


66 (90) 


15(20) 


2(3) 


22 


5 


3 


19 


33 


20 


93 


41 (86) 


4(11) 


120 (328) 




4 . 


0 


3(16) 


2(5) 


4(8) 


7 


13(14) 


17 
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myelin proteins result in severe demyelina- 
tion, which is a pathological condition in 
which the myelin is lost and the nerve con- 
duction is severely impaired (130). Humans 
ave at least 10 genes belonging to four 
different families involved in myelin produc- 



the Human genome 

tion (five myelin P0, three myelin proteolip- 
id, myelin basic protein, and myelin-oligo- 
dendrocyte glycoprotein, or MOG), and pos- 
sibly more-remotely related members of the 
MOG family. Flies have only a single myelin 
proteolipid, and worms have none at all. 



Intercellular and intracellular signaling 
pathways in development and homeostasis. 
Many protein families that have expanded in 
humans relative to the invertebrates are in- 
volved in signaling processes, particularly in 
response to development and differentiation 
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number. 



PF00254 
PF01590 
PF01344 
PF00560 
PF00917 
PF00989 
PF00595 
PF00169 
PF01535 
PF0O536 
PF01369 
PF00017 
PF00018 
PF01740 
PF00515 
PF00400 
PF00397 
PF00569 

PF01754 
PF01388 
PF01426 
PF00643 
PF00533 
F00439 
F00651 
PF00145 
PF0038S 



PF00125 
PF00134 
PF0O27O 
PF01529 
PF00646 
PF00250 
PF00320 
PF01585 
PF00010 
PF00850 
PF00046 
PF01833 
PF02373 
PF02375 
PF00013 
PF01352 
PF00104 

PF00412 
PF00917 
PF00249 
PF02344 
PF01753 
PF00628 
PF00157 
PF02257 
PF0O076 



PF02037 
00622 
F01852 
PF00907 



' Domain name 



Domain description . 



FKBP FKBP-type peptidyl-prolyl ds-trans isomerases 

CAF GAF domain 

Kelch Kelch motif 

LRR** Leucine Rich Repeat 

MATH MATH domain 

PAS PAS domain 

PDZ PDZ domain (Also known as DHR or CLCF) 

PH PH domain 

PPR** PPR repeat 

SAM SAM domain (Sterile alpha motif) 

Sec7 Sec7 domain 

SH2 Src homology 2 (SH2) domain 

SH3 Src homology 3 (SH3) domain 

STAS STAS domain 

TPR** TPR domain 

WD40** WD40 domain 

WW WW domain 

ZZ ZZ-Zinc finger present in dystrophin, CBP/p300 

Nuclear interaction domains 

Zf-A20 A20-(ike zinc finger 

ARID ARID DNA binding domain 

BAH BAH domain 

Zf-BJ>ox** B-box zinc finger 

BRCT BRCA1 C Terminus (BRCT) domain 

Bromodomain Bromodomain 

BTB BTB/POZ domain 

DNA_methylase C-5 cytosine-specific DNA methylase 

Chromo chromo' (CHRromatin Organization Modifier) 
domain 

Histone Core histone H2A/H2B/H3/H4 

Cyclin Cyclin 

DEAD DEAD/DEAH box helicase 

Zf-DHHC DHHC zinc finger domain 

F-box** F-box domain 

Forehead Fork head domain 

GATA CATA zinc finger 

G-patch C-patch domain 

HLH** Helix-loop-helix DNA-binding domain 

Hist_deacetyl Histone deacetylase family 

Homeobox Homeobox domain 

TIG IPT/TIG domain 

JmjC JmjC domain 

JmjN JmjN domain 

KH-domain KH domain 

KRAB KRAB box 

Hormone_rec Ligand-binding domain of nudear hormone 
receptor 

UM UM domain containing proteins 

MATH MATH domain 
MybJ)NA-binding Myb-like DNA-binding domain 

Myc-LZ Myc leucine zipper domain 

Zf-MYND MYND finger 

PHD PHD-finger * 

Pou pou domain— N-terminal to homeobox domain 
RFXJ>NAJ>inding RFX DNA-binding domain 

Rrm RNA recognition motif (aJca, RRM, RBD, or RNP 
domain) 

SAP SAP domain 

SPRY SPRY domain 

START START domain 

T-box T-box 
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Table 18 {Continued) 



The Human genome 



Domain name 



"Domain description 



PF02135 
PF01285 
PF02176 
PF00352 

PF00567 
PF00642 
PF00096 
PF00097 
PF00098 



Zf-TAZ 
TEA 

Zf-TRAF 
TBP 

TUDOR 

Zf-CCCH 

Zf-C2H2** 

Zf-C3HC4 

Zf-CCHC 



TAZ finger 
TEA domain 
TRAF-type zinc finger 

Transcription factor TFIID (or TATA-binding 

protein, TBP) 
TUDOR domain 

Zinc finger. C-x8-C-x5-C-x3-H type (and similar) 

Zinc finger, C2H2 type 

Zinc finger, C3HC4 type (RING finger) 

Zinc knuckle 



2(3) 
4 

6(9) 
2(4) 

9(24) 
17(22) 
564 (4500) 
135(137) 
9 07) 



(Tables i8-and 19). They include secreted 
hormones and growth factors, receptors, in- 
tracellular signaling molecules, and transcrip- 
tion factors. 

Developmental signaling molecules that are 
enriched in the human genome include growth 
factors such as wnt, transforming growth fac- 
tor-p (TGF-3), fibroblast growth factor (FGF), 
nerve growth factor, platelet derived growth 
factor (PDGF), and ephrins. These growth fac- 
tors affect tissue differentiation and a wide 
range of cellular processes involving actin-cy- 
toskeletal and nuclear regulation. The corre- 
sponding receptors of these developmental li- 
gands are also expanded in humans. For exam- 
ple, our analysis suggests at least 8 human 
ephrin genes (2 in the fly, 4 in the worm) and 12 
ephrin receptors (2 in the fly, 1 in the worm). In 
the wnt signaling pathway, we find 18 wnt 
family genes (6 in the fly, 5 in the worm) and 
12 frizzled receptors (6 in the fly, 5 in the 
worm). The Groucho family of transcriptional 
corepressors downstream in the wnt pathway 
are even more markedly expanded, with 13 
predicted members in humans (2 in the fly, 1 in 
the worm). 

Extracellular adhesion molecules involved 
in signaling are expanded in the human genome 
(Tables 18 and 19). The interactions of several 
of these adhesion domains with extracellular 
matrix proteoglycans play a critical role in host 
defense, morphogenesis, and tissue repair 
(757). Consistent with the well-defined role of 
heparan sulfate proteoglycans in modulating 
these interactions (752), we observe an expan- 
sion of the heparin sulfate sulfotransferases in 
the human genome relative* to worm and fly. 
These sulfotransferases modulate tissue differ- 
entiation (755). A similar expansion in humans 
is noted in structural proteins that constitute the 
actin-cytoskeletal architecture. Compared with 
the fly and worm, we observe an explosive 
expansion of the nebulin (35 domains per pro- 
tein on average), aggrecan (12 domains per 
protein on average), and plectin (5 domains per 
protein on average) repeats in humans. These 
repeats are present in proteins involved in mod- 
ulating the actin-cytoskeleton with predominant 
expression in neuronal, muscle, and vascular 
tissues. 



- Comparison across the.five sequenced eu- 
karyotic organisms revealed several expand- 
ed protein families and domains involved in 
cytoplasmic signal transduction (Table 18). 
In particular, signal transduction pathways 
playing roles in developmental regulation and 
acquired immunity were substantially en- 
riched. There is a factor of 2 or greater ex- 
pansion in humans in the Ras superfamily 
GTPases and the GTPase activator and GTP 
exchange factors associated with them. Al- 
though there are about the same number of 
tyrosine kinases in the human and C. elegans 
genomes, in humans there is an increase in 
the SH2, PTB, and ITAM domains involved 
in phosphotyrosine signal transduction. Fur- 
ther, there is a twofold expansion of phos- 
phodiesterases in the human genome com- 
pared with either the worm or fly genomes. 

The downstream effectors of the intracellu- 
lar signaling molecules include the transcription 
factors that transduce developmental fates. Sig- 
nificant expansions are noted in the ligand- 
binding nuclear hormone receptor class of tran- 
scription factors compared with the fly genome, 
although not to the extent observed in the worm 
(Tables 18 and 19). Perhaps the most striking 
expansion in humans is in the C2H2 zinc finger 
transcription factors. Pfam detects a total of 
4500 C2H2 zinc finger domains in 564 human 
proteins, compared with 771 in 234 fly proteins. 
This means that there has been a dramatic 
expansion not only in the number of C2H2 
transcription factors, but also in the number of 
these DNA-binding motifs per transcription 
factor (8 on average in humans, 3.3 on average 
in the fly, and 2.3 on average in the worm). 
Furthermore, many of these transcription fac- 
tors contain either the KRAB or SCAN do- 
. mains, which are not found in the fly or worm 
genomes. These domains are involved in the 
oligomerization of transcription factors and in- 
crease the combinatorial partnering of these 
factors. In general, most of the transcription 
factor domains are shared between the three 
animal genomes, but the reassortment of these 
domains results in organism-specific transcrip- 
tion factor families. The domain combinations 
found in the human, fly, and worm include the 
BTB with C2H2 in the fly and humans, and 
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homeodomains alone or in combination with 
Pou and LIM domains in all of the animal 
genomes. In plants, however, a different set of 
transcription factors are expanded, namely, the 
myb family, and a unique set that includes VP1 
and AP2 dornain^ntaining proteins (134). 
The yeast genome has a paucity of transcription 
factors compared with the multicellular eu- 
karyotes, and its repertoire is limited to the 
expansion of the yeast-specific C6 transcription 
factor family involved in metabolic regulation. 

While we have illustrated expansions in a 
subset of signal transduction molecules in the 
human genome compared with the other eu- 
karyotic genomes, it . should be noted that 
most of the protein domains are highly con- 
served. An interesting observation, is that " 
worms and humans have approximately the 
same number of both tyrosine kinases and 
serine/threonine kinases (Table 19). It is im- 
portant to note, however, that these are mere- 
ly counts of the catalytic domain; the proteins 
that contain these domains also display a 
wide repertoire of interaction domains with 
significant combinatorial diversity. 

Hemostasis. Hemostasis is regulated pri- 
marily by plasma proteases of the coagulation 
pathway and by the interactions that occur be- 
tween the vascular endothelium and platelets. 
Consistent with known anatomical and physio- 
logical differences between vertebrates and in- 
vertebrates, extracellular adhesion domains that 
constitute proteins integral to hemostasis are 
expanded in the human relative to the fly and 
worm (Tables 18 and 19). We note the evolu- 
tion of domains such as FJMAC, FN1, FN2, 
and Clq that mediate surface interactions be- 
tween hematopoeitic ceils and the vascular ma- 
trix. In addition, there, has been extensive re- . 
cruitment of more-ancient animal-specific do- 
mains such as VWA, VWC, VWD, kringle, 
and FN3 into multidomain proteins that are 
involved in hemostatic regulation. Although we 
do not find a large expansion in the total num- 
ber of serine proteases, this enzymatic domain 
has been specifically recruited into several of 
these multidomain proteins for proteolytic reg- 
ulation in the vascular compartment These are 
represented in plasma proteins that belong to 
the kinin and complement pathways. There is a 
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significant expansion in two families of matrix 
metalloproteases: ADAM (a disintegrin and 
metalloprotease) and MMPs (matrix metallo- 
proteases) (Table 19). Proteolysis of extracel- 
lular matrix (ECM) proteins is critical for tissue 
development and for tissue degradation in dis- 
eases such as cancer, arthritis, Alzheimer's dis- 
ease, and a variety of Mammatory conditions 
{135, 136). ADAMs are a family of integral 
membrane proteins with a pivotal role'in fibrin- 
ogenolysis and • modulating interactions : be- 
tween hematopoietic components and the 
vascular matrix components. These proteins 
have been shown to cleave matrix proteins, 
and even signaling molecules: ADAM- 17 
converts tumor necrosis factor-©:, and 
ADAM- 10 has been implicated in the Notch 
signaling pathway (135). We have identified 
19 members of the matrix metalloprotease 
family, and a total of 51 members of the 
ADAM and ADAM-TS families. 

Apoptosis. Evolutionary conservation of 
some of the apoptotic pathway components 
across eukarya is consistent with its central 
role in developmental regulation and as a 
response to pathogens and stress signals. The 
signal transduction pathways involved in pro- 
grammed cell death, or apoptosis, are medi- 
ated by interactions between well-character- 
ized domains that include extracellular do- 
mains, adaptor (protein-protein interaction) 
domains, and those found in effector and 

I regulatory enzymes (137). We enumerated 
]e protein counts of central adaptor and ef- 
;ctor enzyme domains that are found only in 
the apoptotic pathways to provide an estimate 
of divergence across eukarya and relative 
expansion in the human genome when com- 
pared with the fly and worm (Table 18). 
Adaptor domains found in proteins restricted 
only to apoptotic regulation such as the DED 
domains are vertebrate-specific, whereas oth- 
ers like BIR, CARD, and Bcl2 are represent- 
ed in the fly and worm (although the number 
of Bcl2 family members in humans is signif- 
icantly expanded). Although plants and yeast 
lack the caspases, caspase-like molecules, 
namely the para- and meta-caspases, have 
been reported in these organisms (138). Com- 
pared with other animal genomes, the human 
genome shows an expansion in the adaptor 
and effector domain-containing proteins in- 
volved in apoptosis, as well as in the pro- 
teases involved in the cascade such as the 
caspase and calpain families. 

Expansions of other protein families. 
Metabolic enzymes. There are fewer cyto- 
chrome P450 genes in humans than in either 
the fly or worm. Lipoxygenases (six in hu- 
mans), on the other hand, appear to be specific 
to the vertebrates and plants, whereas the lip- 
^^wgenase-activating proteins (four in humans) 
be vertebrate-specific. Lipoxygenases are 
^■Polved in arachidonic acid metabolism, and 
they and their activators have been implicated 



the Human genome 

in diverse human pathology ranging from - posed GAPDH pseudogenes (139) which 

allergic responses to cancers. One of the most may account for this fppaJen, 2^3? 

SSh^ ^pansions however, is in However, it is mterestbgfcat GA^DH ong 

Je number of gIyceraWehyde-3 -phosphate -known as a conserved enzyme involved Jfa 

r^l ^T^F^ g , CneS (46 fa ^ basic holism ^d acXall X I from 

mans, 3 in the fly, and 4 in the worm). There bacteria to humans, has recently beeJ show^ 

1S , however, evidence for many retrotrans- to have other functions. It has a second S 
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Cytokine receptorf 
Bradykinin/C-C chemokine receptor 
Fl cytokine receptor 
Interferon receptor 
Interleukln receptor 
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MCSF receptor 
TNF receptor • 
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Polymeric-immunoglobulin receptor 
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alytic activity, as a uracil DNA glycosylase 
. (140) and functions as a cell cycle regulator 
(141) and has even been implicated in apo- 
ptosis (142). 

Translation. Another striking set of hu- 
man expansions has occurred in certain fam- 
ilies involved in the translational machinery. 
We identified 28 different ribosomal subunits 
-that each have at least 10 copies in the ge-, 
nome; on average, for all ribosomal proteins 
there is about an 8- to 10-fold expansion in 
the number of genes relative to either the 
worm or fly. Retrotransposed pseudogenes 

■ • ' 

Table 19 [Continued) 



The human genome 

• may. account for many of these expansions 
[see the discussion above and (143)]. Recent 
evidence suggests that a number of ribosomal 
proteins have secondary functions indepen- 
dent of their involvement in protein biosyn- 
thesis; for example, L13a and the related L7 
subunits (36 copies in humans) have been 
shown to induce apoptosis (144) 
, : There is also a four- to fivefold expansion 
in the elongation factor 1-alpha family 
.(eEFIA; 56 human genes). Many of these 
expansions likely represent intronless para- 
logs that have presumably arisen from retro- 
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Signaling moleculesf 
Calcitonin 
Ephrin 
FGF 

Glucagon 

Glycoprotein hormone beta chain 
Insulin 

Insulin-like hormone 
Nerve growth factor 
Neuregulin/heregulin 
neuropeptide Y 
PDGF 
Relaxin 
Stannocalcin 
Thymopoeitin 
Thyomosin beta 
TGF-p 
VEGF 
Wnt 
Receptorsf 
Ephrin receptor 
FGF receptor 
Frizzled receptor 
Parathyroid hormone receptor - 
VEGF receptor 

BDNF/NT-3 nerve growth factor 
receptor 

Dual-specificity protein phosphatase 
S/T and dual-specificity protein 

kinasef 
S/T protein phosphatase 

Y protein kinasef 

Y protein phosphatase 

ARF family 

Cyclic nucleotide phosphodiesterase 

G protein -coupled receptorsf J 

G-protein alpha 

G-protein beta 

G-protein gamma 

Ras superfamily 

G-protein modulatorsf 

ARF GTPase-activating 

Neurofibromin 

Ras GTPase-activating 

Tuberin 

Vav proto-oncogene family 



Developmental and homeostatk regulators 
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transposition, and again there is evidence that 
many of these may be pseudogenes (145). 
However, a second form (eEFlA2) of this 
factor has been identied with tissue-specific 
expression in skeletal muscle and a comple- 
mentary expression pattern to the ubiquitous- 
ly expressed eEFIA (146). 

Ribonucleoprqteins. Alternative splicing 
results in .multiple transcripts from a single 
gene, and can therefore generate additional 
diversity in an organism's protein comple- 
ment. We have identified 269 genes for ri- 
bonucleoproteins. This represents over 2.5 
times the number of ribonucleoprotein genes 
in the . worm, two times that of the fly, and 
about the same as the 265 identified in the 
Arabidopsis genome. Whether the diversity 
of ribonucleoprotein genes in humans con- 
tributes to gene regulation at either the splic- 
ing or translational level is unknown. 

Posttranslational modifications. In this 
set of processes, the most prominent expan- 
sion is the transglutaminases, calcium-depen- 
dent enzymes that catalyze the cross-linking 
of proteins in cellular processes such as he- 
mostasis and apoptosis (147). The vitamin 
K-dependent gamma carboxylase gene prod- 
uct acts on the GLA domain (missing in the 
fly and worm) found in coagulation factors, 
osteocalcin, and matrix GLA protein (148). 
Tyrosylprotein sulfotransferases participate . 
in the posttranslational modification of pro- 
teins involved in inflammation and hemosta- 
sia including coagulation factors and chemo- 
kine receptors (149). Although there is no 
significant numerical increase in the counts 
for domains involved in nuclear protein mod- 
ification, there are a number of domain ar- 
rangements in the predicted human proteins 
that are not found in the other currently se- 
quenced genomes. These include the tandem 
association of two histone deacetylase do- 
mains in HD6 with a ubiquitin finger domain, 
a feature lacking in the fly genome. An ad- 
ditional example is the co-occurrence of im- 
portant nuclear regulatory enzyme PARP 
(poly-ADP ribosyl transferase) domain fused 
to protein-interaction domains— BRCT and 
VWA in humans. 

Concluding remarks. There are several 
possible explanations for the differences in 
phenotypic complexity observed in humans 
when compared to the fly and worm. Some of 
these relate to the. prominent differences in 
the immune system, hemostasia neuronal, 
vascular, and cytoskeletal complexity. The 
finding that the human genome contains few- 
er genes than previously predicted might be 
compensated for by combinatorial diversity 
generated at the levels of protein architecture, 
transcriptional and translational control, post- 
translational modification of proteins, or 
posttranscriptional regulation. Extensive do- 
main shuffling to increase or alter combina- 
torial diversity can provide an exponential 
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increase in the ability to mediate protein- 
protein interactions without dramatically in- 
creasing the absolute size of the protein com- 
plement (ISO). Evolution of apparently new 
(from the perspective of sequence analysis) 
protein domains and increasing regulatory 
complexity by domain accretion both quanti- 
tatively and qualitatively (recruitment of nov- 
el domains with preexisting ones) are two 
features that, we observe in humans. Perhaps ' 
the best illustration of this trend is the C2H2 
zinc finger-contaming transcription factors 
where we see expansion in the number of 
domains per protein, together with verte- 
brate-specific domains such as KRAB and 
SCAN. Recent reports on the prominent use 
of internal nbosomal entry sites in the human 
genome to regulate translation of specific 
classes of proteins suggests that this is an area 
that needs further research to identify' the full 
extent of this process in the human genome 
At the posttranslational level, although 
we provide examples of expansions of some 
protein families involved in these modifica- 
tions, further experimental evidence is re- 
quired to evaluate whether this is correlated 
with mcreased complexity in protein process- 
ing. Posttranscriptional processing and the 
extent of isoform generation in the human 
remain to be cataloged in their entirety. Given 
the conserved nature of the spliceosomal ma- 
chinery, further analysis will be required to 
dissect regulation at this level. 
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^Conclusions 

8.1 The whole-genome sequencing 
approach versus BAC by BAC 

Experience in applying the whole-genome 
shotgun sequencing approach to a diverse 
group of organisms with a wide range of 
genome sizes and repeat content allows us to 
assess its strengths and weaknesses. With the 
success of the method for a large number of 
microbial genomes, Drosophila, and now the 
human, there can be no doubt concerning the 
utility of this method. The large number of 
microbial genomes that have been sequenced 
by this method (/J, 80, J 52) demonstrate that 
megabase-sized genomes can be sequenced 
efficiently without any input other that the de 
novo mate-paired sequences. With more 
complex genomes like those of Drosophila or 

human map information, in the form of well- 
ordered markers, has been critical for Ions- 
range ordering of scaffolds. For joining scaf- 
folds into chromosomes, the quality of the 
map (in terms of the order of the markers) is 
more important than the number of markers 
per se. Although this mapping could have 
peen performed concurrently with sequenc- 

«ie prior existence of mapping data was 
cial. During the sequencing of the A 
7a genome, sequencing of individual 
bAC clones permitted extension of the se- 
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quence well into centromeric regions and al- 
lowed high-quality resolution of complex re- 
peat regions. Likewise, in Drosophila, the 
BAC physical map was most useful in re- 
gions near the highly repetitive centromeres 
and telomeres. WGA has been found to de- 
liver excellent-quality reconstructions of the 
unique regions of the genome. As the genome 
size, and more importantly the repetitive con- 
tent, increases, the WGA approach delivers 
less of the repetitive sequence. 

The cost and overall efficiency of clone-by- 
clone approaches makes them difficult to justify 
as a stand-alone strategy for future large-scale 
genome-sequencing projects. Specific' applica- 
tions of B AC-based or other clone mapping and 
sequencing strategies to resolve ambiguities in 
sequence assembly that cannot be efficiently 
resolved with computational approaches alone 
are clearly worth exploring. Hybrid approaches 
to whole-genome sequencing will only work if 
there is sufficient coverage in both the whole- 
genome shotgun phase and the BAC clone se- 
quencing phase.. Our experience with human 
genome assembly suggests that this will require 
at least 3 X coverage of both whole-genome and 
BAC shotgun sequence data. 

8.2 The low gene number in humans 

We have sequenced and assembled —95% of 
the euchrpmatic sequence of H. sapiens and 
used a new automated gene prediction meth- 
od to produce a preliminary catalog of the 
human genes. This has provided a major sur- 
prise: We have found far fewer genes (26,000 
to 38,000) than the earlier molecular pre- 
dictions (50,000 to over 140,000). Whatever 
the reasons for this current disparity, only 
detailed annotation, comparative genomics 
(particularly using the Mus musculus ge- 
nome), and careful molecular dissection of 
complex phenotypes will clarify this critical 
issue of the basic "parts list" of our genome. 
Certainly, the analysis is still incomplete and 
considerable refinement will occur in the 
years to come as the precise structure of each 
transcription unit is evaluated. A good place 
to start is to determine why the gene esti- 
mates derived from EST data are so discor- 
dant with our predictions. It is likely that the 
following contribute to an inflated gene num- 
ber derived from ESTs: the variable lengths 
of 3'- and 5 '-untranslated leaders and trailers; 
the little-understood vagaries of RNA pro- 
cessing that often leave intronic regions in an 
unspliced condition; the finding that nearly 
40% of human genes are alternatively spliced 
(153); and finally, the unsolved technical 
problems in EST library construction where 
contamination from heterogeneous nuclear 
RNA and genomic DNA are not uncommon. 
Of course, it is possible that there are genes 
that remain unpredicted owing to the absence 
of EST or protein data to support them, al- 
though our use of mouse genome data for 
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.predicting genes should limit this number. As 
was true at the beginning of genome sequenc- 
ing, ultimately it will be necessary to measure 
mRNA in specific cell types to demonstrate 
the presence of a gene. 

J. B. S. Haldane speculated in 1937 that a 
population of organisms might have to pay a 
price for the number of genes it can possibly 
carry. He theorized that when the number of 
genes becomes too large, each zygote carries 
so many new deleterious mutations that the 
population simply cannot, maintain itself. On 
, the basis of this premise, and on the basis of 
available mutation rates and x-ray-induced 
mutations at specific loci, Muller, in 1967 
(154), calculated that the mammalian ge- 
nome would contain a maximum of not much 
more than 30,000 genes (155). An estimate of 
30,000 gene loci for humans was also arrived 
at by Crow and Kimura (156). Muller's esti- 
mate for D. melanogaster was 10,000 genes, 
compared to 13,000 derived by annotation of 
the fly genome (26, 27). These arguments for 
the theoretical maximum gene number were 
based on simplified ideas of genetic load — 
that all genes have a certain low rate of 
mutation to a deleterious state. However, it is 
clear that many mouse, fly, worm, and yeast 
knockout mutations lead to almost no dis- 
cernible phenotypic perturbations. 

The modest number of human genes 
means that we must look elsewhere for the 
mechanisms that generate the complexities 
inherent in human development and the so- 
phisticated signaling systems that maintain 
homeostasis. There are a large number of 
ways in which the functions of individual 
genes and gene products are regulated. The 
degree of "openness" of chromatin structure 
and hence transcriptional activity is regulated 
by protein complexes that involve histone 
and DNA enzymatic modifications. We enu- 
merate many of the proteins that are likely 
involved in nuclear regulation in Table 19. 
The location, timing, and quantity of tran- 
scription are intimately linked to nuclear sig- 
nal transduction events as well as by the 
tissue-specific expression of many of these 
proteins. Equally important are regulatory 
DNA elements that include insulators, re- 
peats, and endogenous viruses (757); meth- 
ylarion of CpG islands in imprinting (158); 
and promoter-enhancer and intronic regions 
that modulate transcription. The spliceosomal 
machinery consists of multisubunit proteins 
(Table 19) as well as structural and catalytic 
RNA elements (159) that regulate transcript 
structure through alternative start and termi- 
nation sites and splicing. Hence, there is a 
need to study different classes of RNA mol- 
ecules (160) such as small nucleolar RNAs, 
antisense riboregulator RNA, RNA involved 
in X-dosage compensation, and other struc- 
tural RNAs to appreciate their precise role in 
regulating gene expression. The phenomenon 



of RNA editing in which coding changes 
occur directly at the level of mRNA is of 
clinical and biological relevance (161). Final- 
ly, examples of translational control include 
internal ribosomal entry sites that are found 
in proteins involved in cell cycle regulation 
and apoptosis (162). At the protein level,, 
minor alterations in the .nature of protein- 
protein interactions, protein modifications, 
and localization can have dramatic effects on 
cellular physiology (163). This dynamic sys- 
tem therefore has many ways to modulate 
activity, which suggests that definition of 
complex systems by analysis of single genes 
is unlikely to be entirely successful. 

Tn situ studies have shown that the human 
genome is asymmetrically populated with 
G+C content, CpG islands, and genes (68). 
However, the genes are not distributed quite 
as unequally as had been predicted (Table 9) 
(69). The most G+C-rich fraction of the ge- 
nome, H3 isochores, constitute more of the 
genome than previously thought (about 9%), 
and are the most gene-dense fraction, but 
contain only 25% of the genes, rather than the 
predicted -40%. The low G+C L isochores 
make up 65% of the genome, and 48% of the 
genes. This inhomogeneity, the net result of 
millions of years of mammalian gene dupli- 
cation, has been described as the "desertifi- 
cation" of the vertebrate, genome (71). Why 
are there clustered regions of high and low 
gene density, and are these accidents of his- 
tory or driven by selection and evolution? If 
these deserts are dispensable, it ought to be 
possible to find mammalian genomes that are 
far smaller in size than the human genome. 
Indeed, many species of bats have genome 
sizes that are much smaller than that of hu- 
mans; for example, Miniopterus, a species of 
Italian bat, has a genome size that is only 
50% that of humans (164). Similarly, Mun- 
tiacus, a species of Asian barking deer, has a 
genome size that is -70% that of humans. 

8.3 Human DNA sequence variation 
and its distribution across the genome 

This is the first eukaryotic genome in which a 
nearly uniform ascertainment of polymorphism 
has been completed. Although we have identi- 
fied and mapped more than 3 million SNPs, this 
by no means implies that the task of finding and 
cataloging SNPs is complete. These represent 
only a fraction of the SNPs present in the 
human population as a whole. Nevertheless, 
this first glimpse at genome-wide variation has 
revealed strong inhomogeneities in the distribu- 
tion of SNPs across the genome. Polymorphism 
in DNA carries with it a snapshot of the past 
operation of population genetic forces, includ- 
ing mutation, migration, selection, and genetic 
drift. The availability of a dense array of SNPs 
will allow questions related to each of these 
factors to be addressed on a genome-wide basis. 
SNP studies can establish the range of haplo- 
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types present in subjects of different ethnogeo- 
graphic origins, providing insights into popula- 
tion history and migration patterns. Although 
such studies have suggested that modem human 
lineages derive from Africa, many important 
questions regarding human origins remain un- 
answered, and more analyses using detailed 
SNP maps will be needed to settle these con- 
troversies. In addition to providing evidence for 
population expansions, migranon, -and admix-, 
ture, SNPs can serve as markers for the extent 
of evolutionary constraint acting on particular 
genes. The correlation between patterns of in- 
traspecies and interspecies generic variation 
may prove to be especially informative to iden- 
tify sites of reduced generic diversity that may 
mark loci where sequence variations are not 
tolerated. 

The remarkable heterogeneity in SNP 
density implies that there are a variety of 
forces acting on polymorphism— sparse re- 
gions may have lower SNP density because 
the mutation rate is lower, because most of 
those regions have a lower fraction of muta- 
tions that are tolerated, or because recent 
strong selection in favor of a newly arisen 
allele "swept" the linked variation out of the 
population (165). The effect of random ge- 
netic drift also varies widely across the ge- 
nome. The nonrecombining portion of the Y 
chromosome faces the strongest pressure 
from random drift because there are roughly 
one-quarter as many Y chromosomes in the 
population as there are autosomal chromo-" 
somes, and the level of polymorphism on the 
Y is correspondingly less. Similarly, the X 
chromosome has a smaller effective popu- 
lation size than the autosomes, and its nu- 
cleotide diversity is also reduced. But even 
across a single autosome, the effective pop- 
ulation size can vary because the density of 
deleterious mutations may vary. Regions of 
high density of deleterious mutations will 
see a greater rate of elimination by selec- 
tion, and the effective population size will 
be smaller (166). As a result, the density of 
even completely neutral SNPs will be lower 
in such regions. There is a large literature 
on the association between SNP density 
and local recombination rates in Drosoph- 
Ha, and it remains an important task to 
assess the strength of this association in the 
human genome, because of its impact on 
the design of local SNP densities for dis- 
ease-association studies. It also remains an 
important task to validate SNPs on a 
genomic scale in order to assess the degree 
of heterogeneity among geographic and 
ethnic populations. 
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8.4 Genome complexity 

We will soon be in a position to move away 
Trom the cataloging of individual compo- 
nents of the system, and beyond the sim- 
plistic notions of "this binds to that, which 



then docks on this, and then the complex 
moves there. . . (167) to the exciting area 
of network perturbations, nonlinear re- 
sponses and thresholds, and their pivotal 
role in human diseases. 

The enumeration of other "parts lists" re- 
veals that in organisms with complex nervous 
systems, neither gene number, neuron number, 
nor number of cell types correlates in any" 
meaningful manner with even, simplistic mea- 
sures of structural or behavioral complexity. 
Nor would they be expected to; this is the realm 
of nonlinearities and epigenesis (168). The 520 
million neurons of the common octopus exceed 
the neuronal number in the brain of a mouse by 
an order of magnitude. It is apparent from a 
comparison of genomic data on the mouse and 
human, and from comparative mammalian neu- 
roanatomy (169), that the morphological and 
■behavioral diversity found in mammals is un- 
derpinned by a similar gene repertoire and sim- 
ilar neuroanatomies. For example, when one 
compares a pygmy marmoset (which is only 4 
inches tall and weighs about 6 ounces) to a 
chimpanzee, the brain volume of this minute 
primate is found to be only about 1.5 cm 3 , two 
orders of magnitude less than that of a chimp 
and three orders less than that of humans. Yet 
the neuroanatomies of all three brains are strik- 
ingly similar, and the behavioral characteristics 
of the pygmy marmoset are little different from 
those of chimpanzees. Between humans and . 
chimpanzees, the gene number, gene structures 
and . functions, chromosomal and genomic or- 
ganizations, and cell types and neuroanatomies 
are almost indistinguishable, yet the develop- 
mental modifications that predisposed human 
lineages to cortical expansion and development 
of the larynx, giving rise to language, culminat- 
ed in a massive singularity that by even the 
simplest of criteria made humans more com- 
plex in a behavioral sense. 

Simple examination of the number of neu- 
rons, cell types, or genes or of the genome 
size does not alone account for the differenc- 
es in complexity that we observe. Rather, it is 
the interactions within and among these sets 
that result in such great variation. In addition, 
it is possible that there are "special cases" of 
regulatory gene networks that have a dispro- 
portionate effect on the overall system. We 
have presented several examples of "regula- 
tory genes" that are significantly increased in 
the human genome compared with the fly and 
worm. These include extracellular ligands 
and their cognate receptors (e.g., wnt, friz- 
zled, TGF-p, ephrins, and connexins), as well 
as nuclear regulators (e.g., the KRAB and 
homeodomain transcription factor families), 
where a few proteins control broad develop- 
mental processes. The answers to these 
"complexities" perhaps lie in these expanded 
gene families and differences in the regulato- 
ry control of ancient genes, proteins, path- 
ways, and cells. 



8.5 Beyond single components 

While few would disagree with the intuitive 
conclusion that Einstein's brain was more 
complex than that of Drosophili, closer com- 
parisons such as whether the set of predicted 
human proteins is more complex than the 
protein set of Drosophila, and if so, to what 
degree, are not straightforward, since protein, 
protein domain, or protein-protein interaction 
/measures do not capture context-dependent' 
interactions that underpin, the ' dynamics un- ' 
deriving phenotype. 

Currently, there are more than 30 different 
mathematical descriptions of complexity (170), 
However, we have yet to understand the math- 
ematical dependency relating the number of 
genes with organism complexity. One pragmat- 
ic approach to the analysis of biological sys- 
tems, which are composed of nonidentical ele- 
ments (proteins, protein complexes, interacting 
cell types, and interacting neuronal popula- 
tions), is through graph theory (171). The ele- 
ments of the system can be represented by the 
vertices of complex topographies, with the edg- 
es representing the interactions between them. 
Examination of large networks reveals that they 
can self-organize, but more important, they can 
be particularly robust This robustness is not 
due to redundancy, but is a property of inho- 
mogeneously wired networks. The error toler- 
ance of such networks comes with a price; they 
are vulnerable to the selection or removal of a 
few nodes that contribute disproportionately to 
network stability. Gene, knockouts provide an ' 
illustration. Some knockouts may have minor 
effects, whereas others have catastrophic effects 
on the system. In the case of vimentin, a sup- 
posedly critical component of the cytoplasmic 
intermediate filament network of mammals, the 
knockout of the gene in mice reveals them to be 
reproductively normal, with no obvious phenc- 
typic effects (172), and yet the usually conspic- 
uous vimentin network is completely absent 
On the other hand, -30% of knockouts in 
Drosophila and mice correspond to critical 
nodes whose reduction in gene product, or total 
elimination, causes the network to crash most 
of the time, although even in some of these 
cases, phenotypic normalcy ensues, given the 
appropriate genetic background. Thus, there are 
no "good" genes or "bad" genes, but only net- 
works that exist at various levels and at differ- 
ent connectivities, and at different states of 
sensitivity to perturbation. Sophisticated math- 
ematical analysis needs to be constantly evalu- 
ated against hard biological data sets that spe- 
cifically address network dynamics. Nowhere is 
this more critical than in attempts to come to 
grips with "complexity " particularly because 
deconvoluting and correcting complex net- 
works that have undergone perturbation, and 
have resulted in human diseases, is the greatest 
significant challenge now facing us. 

It has been predicted for the last 15 years 
that complete sequencing of the human ge- 
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. nome would open up new strategies for hu- 
man biological research and would have a 
major impact on medicine, and through med- 
icine and public health, on society. Effects on 
biomedical research are already being felt 
This assembly of the human genome se- 
quence is but a first, hesitant step on a long 
and exciting journey toward understanding 
the role of the genome in human biology. It 
has been possible only because of innova- 
tions in instrumentation and. software that 
have allowed automation of almost every step 
of the process from DNA preparation to an- .. 
notation. The next steps are clear: We must 
define the complexity that ensues when this 
: relatively modest set of about 30,000 genes is 
expressed. The sequence provides the frame- 
work upon which all the genetics, biochem- 
istry, physiology, and ultimately phenotype 
depend. It provides the boundaries for scien- 
tific inquiry. The sequence is only the first 
level of understanding of the genome. All 
genes and their control elements must be 
identified; their functions, in concert as well 
as in isolation, defined; their sequence varia- 
tion worldwide described; and the relation 
between genome variation and specific phe- 
notypic characteristics determined Now we 
know what we have to explain. 

Another paramount challenge awaits: 
public discussion of this information and its. 
potential for improvement of personal health. . 
Many diverse sources of data have shown 
that any two individuals are more than 99.9% 
identical in sequence, which means that all 
the glorious differences among individuals in 
our species that can be attributed to genes 
falls in a mere 0.1% of the sequence. There 
are two fallacies to be avoided: determinism, 
the idea that all characteristics of the person 
are ^hard-wired" by the genome; and reduc- 
tionism, the view that with complete knowl- 
edge of the human genome sequence, it is 
only a matter of time before our understand- 
ing of gene functions and interactions will 
provide a complete causal description of hu- 
man variability. The real challenge of human 
biology, beyond the task of finding out how 
genes orchestrate the construction and main- 
tenance of the miraculous mechanism of our 
bodies, will lie ahead as we seek to explain 
how our minds have come to organize 
thoughts sufficiently well to investigate our 
own existence. 
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share at least one significant BLAST hit in common. 
This is an especially interesting property of the 

• metric, because it allows the rapid recovery of pro- 
tein families from the proteome for which no mul- 
tiple alignment is possible, thus providing a compu- 
tational basis for the extension of protein homology 
searches beyond those of current HMM- and profile- 
based search methods. Once the whole-proteome 
similarity matrix has been calculated, Lek first par- 
titions the proteome into single-linkage clusters 
{27) on the basis of one or more shared BLAST hits 
between two sequences. Next these single-linkage 
dusters are further partitioned Into sub dusters, 
each member of which shares a user-specified pair- 
wise similarity with the other members of the dus- . 
.ter, as described above. For the purposes of this 
publication, we have focused on the analysis of 
single-linkage clusters and what we have termed 
"complete clusters," e.g., those subctusters for 
which ever/ member has a similarity metric of 1 to 
. eveiy other member of the subduster. We believe 
that the single-linkage and complete clusters are of 
. special interest, In part, because they allow us to 
estimate and to compare sizes of core protein sets 
in a rigorous manner. The rationale for this is as 
follows: if one imagines for a moment a perfect 
clustering algorithm capable of perfectly partition- 
ing one or more perfectly annotated protein sets 
into protein families, it is reasonable to assume that 
the number of clusters will always be greater than, 
or equal to, the number of single-linkage clusters, 
because single-linkage clustering Is a maximally ag- 
glomerative clustering method. Thus, if there exists 
a single protein In the predicted protein set contain- 
ing domains A and B, then it will be clustered by 
single linkage together with all single-domain pro- 
teins containing domains A or B. Likewise, for a 
predicted protein set containing a single multido- 
main protein, the number of real clusters must 
always be less than or equal to the number of 

• complete clusters, because it is impossible to place 
a unique multidomain protein into a complete clus- 
ter. Thus, the single-linkage and complete clusters 
plus singletons should comprise a lower and upper 
bound of sizes of core protein sets, respectively, 
allowing us to compare the relative size and com- 
plexity of different organisms' predicted protein set 
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; A historic 
moment for 
the scientific 
endeavor. 



THE HUMAN 
GENOME 

: umanity has been given a great gift. With the completion of the human 
genome sequence, we have received a powerful tool for unlocking the 
secrets of our genetic heritage and for finding our place among the other 
participants in the adventure of life. 

This week's issue of Science contains the report of the sequencing of 
the human genome from a group of authors led by Craig Venter^of Celera 
Genomics. The report of the sequencing of the human genome from the 
publicly funded consortium of laboratories led by Francis Colhns appears 
in this week's Nature, This sttinning achievement has been portrayed — 
often unfairly — as a competition between two 
ventures, one public and one private. That characterization detracts from 
the awesome accomplishment jointly unveiled this week. In truth, each 
project contributed to the other. The inspired vision that launched the 
publicly funded project roughly 10 years ago reflected, and now rewards, 
the confidence of those who believe that the pursuit of large-scale funda- 
■ mental problems in the life sciences is in the national interest The technical 
innovation and drive of Craig Venter and his colleagues made it possible 
to celebrate this accomplishment far sooner than was believed possible. 
Thus, we can salute what has become, in the end, not a contest but a 
marriage (perhaps encouraged by shotgun) between public funding and 
private entrepreneurship. 

There are excellent scientific reasons for applauding an outcome that ■ ■ , . . 
has given us two winners. Two sequences are better than one; the opportunity for comparison and con- 
vergence^ invaluable. Indeed, a real-world proof of the importance of access to both sets of data can 
be found in the pages ofthis issue of Science, in the comparative analysis by Olivier e/ a/, (p. 1298). 

Mthoueh we have made the point before, it is worth repeating that the sequencing of the human 
genome represents, not an ending, but the beginning of a new approach to biology. As Galas says in 
his Viewpoint (p. 1257), the knowledge that all of the genetic components of any process can be 
identified will give extraordinary new power to scientists. Because of this breakthrough, research 
can evolve from analyzing the effects of individual genes to a more integrated view that examines 
whole ensembles of genes as they interact to form a living human being. Several articles in this issue 
' highlight how this approach is already beginning to revolutionize the way we look at human disease. 
This has been a massive project, on a scale unparalleled in the history of biology, but of cours? 
it has built on the scientific insights of centuries of investigators. By coincidence, this landmark 
announcement falls during the week of the anniversary of the birth of Charles Darwin^ Darwin j> 
message that the survival of a species can depend on its ability to evolve in the face of change is 
peculiarly pertinent to discussions that have gone on in the past year over access to the Celera data. 
(Full information regarding the agreements that were reached to make the data available can be 
found at ww.sciencemag.org/feature/data/announcement/gsp.shl.) We are willing to be flexible.^ 
allowing data repositories other than the traditional GenBank, while insisting on access to all the 
data needed to verify conclusions. In this domain, change is everywhere: Commercial researchers 
are producing more and more potentially valuable sequences, yet (at least in the United States) 
laws governing databases provide scant protection against piracy. Had the Celera data been kept se- 
cret it would have been a serious loss to the scientific community. We hope that our adaptability m 
the face of change will enable other proprietary data to be published after peer review, in a way that 
satisfies our continuing commitment to full access. ^ • 

It should be no surprise that an achievement so stunning, and so carefully watched, has created 
new challenges for the scientific venture. Science is proud to have played a role in bringing this 
discovery onto the public stage. It is literally true that this is a historic moment for the scientific en- 
deavor. The human genome has been called the Book of Life. Rather, it is a library in which with 
rules that encourage exploration and reward creativity, we can find many of the books that will 
help define us and our place in the great tapestry of life. 

Barbara RJasny and Donald Kennedy 
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Sbjct: 59112 aatggcttatttgtcactccagtgcctgtgcttgcagcacagcgtgattattgctccaag 59053 
Query: 541 aatgaaattgaacactgcctgtgctctaaccttggggtcacaagcctggcttgtgatgac 600 

I MM Ml lllllllllll MM Ill II I II 

Sbjct: 59052 aatgaaattgaacactgcctgtgctctaaccttggggtcacaagcctggcttgtgatgac 58993 



Query: 601 aggaggccaaacagcatttgccagttggttctggcatggcttggaatggggagtgatcta 

IMIIIIMI IIIIIIMMIMIII III MM MIMII III MM MM 

Sbjct : S8992 aggaggccaaacagcatttgccagttggttctggcatggcttggaatggggagtgatcta 



660 



58933 



Query: 661 agtcttattatactgtcatatattttgattctgtactctgtacttagactgaactcagct 72 0 

I Mill IM MMMMMMIMI Ml Ml MIIMI MIM INIII; 

Sbjct: 58932 agtcttattatactgtcatatattttgattctgtactctgtacttagactgaactcagct 58873 



Query: 
Sbjct: 



721 gaagctgcagccaaggccctgagcacttgtagttcacatctcaccctcatccttttcttt 780 

IMIIIIIMIIIIIIIIIMMIIIIIIMIIIIIIIIIIMIIIIIIMIMIIIMI 

58872 gaagctgcagccaaggccctgagcacttgtagttcacatctcaccctcatccttttcttt 58813 



Query: 
Sbjct: 



781 tacactattgttgtagtgatttcagtgactcatctgacagagatgaaggctactttgatt 840 

IIIIIIIIIIIIIIIIIMMIIIIIIIIIIIIIIIIIIIillllllllllllllllMI 

58812 tacactattgttgtagtgatttcagtgactcatctgacagagatgaaggctactttgatt 58753 



Query: 
Sbjct: 



841 ccagttctacttaatgtgttgcacaacatcatccccccttccctcaaccctacagtttac 900 

IIIIIIIIIIIIIIIIIMIIIIIIIIIIIIIMIIIIMIIIIIIIIIIIIIMIIM 

58752 ccagttctacttaatgtgttgcacaacatcatccccccttccctcaaccctacagtttat 58693 



Query: 901 gcacttcagaccaaagaacttagggcagccttccaaaaggtgctgtttgcccttacaaaa 

iiiiiiiiiiiiiiiiiiiiiiiMiiiiriiiiiiiriiiiiiiiiiiiiiiiiii.nl 

Sbjct : 58692 gcacttcagaccaaagaacttagggcagccttccaaaaggtgctgtttgcccttacaaaa 



960 



58633 



Query: 961 gaaataagatcttag 975 

III IMMIIIMII 

Sbjct: 58632 gaaataagatcttag 58618 



